对可扩展系统基础的赞扬

Praise for Foundations of Scalable Systems

构建可扩展的分布式系统很困难。这本书只是让事情变得更容易。通过从并发和负载平衡到缓存和数据库扩展等主题,您将学习使系统扩展以满足当今现代世界的需求所需的技能。

Mark Richards,软件架构师、DeveloperToArchitect.com 创始人

Building scalable distributed systems is hard. This book just made it easier. With topics ranging from concurrency and load balancing to caching and database scaling, you’ll learn the skills necessary to make your systems scale to meet the demands of today’s modern world.

Mark Richards, Software Architect, Founder of DeveloperToArchitect.com

戈顿教授通过生动的例子和严肃的风格,介绍并讨论了可扩展分布式系统设计的基础原理、架构和技术。本书是学生和从业者的必备现代教科书。

Anna Liu,亚马逊网络服务高级经理

Through lively examples and a no-nonsense style, Professor Gorton presents and discusses the principles, architectures, and technologies foundational to scalable distributed systems design. This book serves as an essential modern text for students and practitioners alike.

Anna Liu, Senior Manager, Amazon Web Services

这个领域的技术一直在变化,并且有很多炒作和流行语。Ian Gorton 对此进行了剖析,并解释了成功设计大型软件系统所需了解的原则和权衡。

John Klein,卡内基梅隆大学软件工程学院

The technology in this space is changing all the time, and there is a lot of hype and buzzwords out there. Ian Gorton cuts through that and explains the principles and trade-offs you need to understand to successfully design large-scale software systems.

John Klein, Carnegie Mellon University Software Engineering Institute

可扩展性是软件设计中的一个重要主题,本书对架构师和软件工程师需要考虑的许多方面进行了很好的概述。伊恩·戈顿成功地在理论与实践之间取得了良好的平衡,以一种立即有用的方式展示了他的现实生活经验。他轻松愉快的写作风格让人读起来轻松愉快,偶尔会旁白解释软件架构和意大利美食之间的联系等问题。

Eltjo Poort,建筑师,CGI

Scalability is a serious topic in software design, and this book provides a great overview of the many aspects that need to be considered by architects and software engineers. Ian Gorton succeeds in striking an excellent balance between theory and practice, presenting his real-life experience in a way that is immediately useful. His lighthearted writing style makes for an enjoyable and easy read, with the occasional sidetrack to explain things like the link between software architecture and Italian-inspired cuisine.

Eltjo Poort, Architect, CGI

在云计算时代,可扩展性是一个很容易被认为是理所当然的系统特性,直到你发现你的系统还没有具备它。在本书中,Ian Gorton 博士利用其广泛的实践、研究和教学经验,以一种非常通俗易懂的方式解释了可扩展性,并全面介绍了用于实现可扩展性的技术和技巧。当读者发现自己需要构建一个高度可扩展的系统时,它很可能会让他们免于许多痛苦的学习经历!

Eoin Woods 博士,Endava 首席技术官

In the era of cloud computing, scalability is a system characteristic that is easy to take for granted until you find your system hasn’t got it. In this book, Dr. Ian Gorton draws on his wide practical, research, and teaching experience to explain scalability in a very accessible way and provide a thorough introduction to the technologies and techniques that are used to achieve it. It is likely to save its readers from a lot of painful learning experiences when they find that they need to build a highly scalable system!

Dr. Eoin Woods, CTO, Endava

处理分布式系统、微服务架构、无服务器架构和分布式数据库的问题使得创建一个可扩展以支持数万用户的系统变得极具挑战性。Ian Gorton 清楚地列出了这些问题,并为开发人员提供了开发可扩展系统所需的工具。

伦·巴斯,卡内基梅隆大学

Dealing with issues of distributed systems, microservice architecture, serverless architecture, and distributed databases makes creating a system that can scale to support tens of thousands of users extremely challenging. Ian Gorton has clearly laid out the issues and given a developer the tools they need to contribute to the development of a system that can scale.

Len Bass, Carnegie Mellon University

权衡是分布式系统的关键。戈顿教授对分布式系统和其他关键相关领域的现实场景做出了很好的解释,这将帮助您形成权衡心态,以做出更好的决策。

Vishal Rajpal,亚马逊高级软件开发工程师

Trade-offs are key to a distributed system. Professor Gorton puts out great explanations with real-life scenarios for distributed systems and other key related areas, which will help you develop a trade-off mindset for making better decisions.

Vishal Rajpal, Senior Software Development Engineer, Amazon

无论您是分布式系统学习者还是经验丰富的软件工程师,这都是值得一读的书。Gorton 博士汇集了他数十年的学术研究和云行业案例研究,为您提供构建可扩展系统并在云计算时代取得成功所需的关键知识和技能。

李丛,微软软件工程师

This is the book to read, whether you’re a distributed systems learner or an experienced software engineer. Dr. Gorton brings together his decades of academic research and cloud industry case studies to equip you with the key knowledge and skills you need to build scalable systems and succeed in the cloud computing era.

Cong Li, Software Engineer, Microsoft

可扩展系统的基础

Foundations of Scalable Systems

设计分布式架构

Designing Distributed Architectures

伊恩·戈顿

Ian Gorton

可扩展系统的基础

Foundations of Scalable Systems

作者: 伊恩· 戈顿

by Ian Gorton

美国印刷。

Printed in the United States of America.

由O'Reilly Media, Inc. 出版,地址:1005 Gravenstein Highway North, Sebastopol, CA 95472。

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

购买 O'Reilly 书籍可用于教育、商业或促销用途。大多数图书也提供在线版本 ( http://oreilly.com )。欲了解更多信息,请联系我们的企业/机构销售部门:800-998-9938或 corporate@oreilly.com

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • 收购编辑: Melissa Duffield
  • Acquisitions Editor: Melissa Duffield
  • 开发编辑: 弗吉尼亚·威尔逊
  • Development Editor: Virginia Wilson
  • 制作编辑: 乔纳森·欧文
  • Production Editor: Jonathon Owen
  • 文案编辑: 贾斯汀·比林
  • Copyeditor: Justin Billing
  • 校对者: nSight, Inc.
  • Proofreader: nSight, Inc.
  • 索引器: nSight, Inc.
  • Indexer: nSight, Inc.
  • 室内设计师: 大卫·富塔托
  • Interior Designer: David Futato
  • 封面设计: 凯伦·蒙哥马利
  • Cover Designer: Karen Montgomery
  • 插画师: 凯特·杜拉
  • Illustrator: Kate Dullea
  • 2022 年 7 月: 第一版
  • July 2022: First Edition

第一版的修订历史

Revision History for the First Edition

  • 2022-06-29:首次发布
  • 2022-06-29: First Release

有关发布详细信息, 请参阅 https://oreil.ly/scal-sys 。

See https://oreil.ly/scal-sys for release details.

前言

Preface

本书围绕这样一个论点展开:软件系统大规模运行的能力日益成为定义成功的关键因素。随着我们的世界变得更加互联,这种特征只会变得更加普遍。因此,本书的目标是为读者提供分布式和并发系统的核心知识。它还介绍了一系列可用于构建可扩展系统的软件架构方法和分布式技术。

This book is built around the thesis that the ability of software systems to operate at scale is increasingly a key factor that defines success. As our world becomes more interconnected, this characteristic will only become more prevalent. Hence, the goal of this book is to provide the reader with the core knowledge of distributed and concurrent systems. It also introduces a collection of software architecture approaches and distributed technologies that can be used to build scalable systems.

为什么要具有可扩展性?

Why Scalability?

我们世界的变化速度令人畏惧。创新每天都在出现,为我们所有人创造了互动、开展业务、娱乐自己……甚至结束流行病的新能力。这种创新的大部分动力是软件,由大型互联网公司的名副其实的开发人员大军、初创公司的小型团队以及介于两者之间的各种形状和规模的团队编写。

The pace of change in our world is daunting. Innovations appear daily, creating new capabilities for us all to interact, conduct business, entertain ourselves…even end pandemics. The fuel for much of this innovation is software, written by veritable armies of developers in major internet companies, crack small teams in startups, and all shapes and sizes of teams in between.

交付能够响应用户需求的软件系统已经够困难的了,但是对于大规模的系统来说,这变得更加困难。我们都知道,系统在遇到意外的高负载时会突然失效,这种情况(在最好的情况下)会给组织带来不好的声誉,而在最坏的情况下可能会导致失业或公司破产。

Delivering software systems that are responsive to user needs is difficult enough, but it becomes an order of magnitude more difficult to do for systems at scale. We all know of systems that fail suddenly when exposed to unexpected high loads—such situations are (in the best cases) bad publicity for organizations, and at worst can result in lost jobs or destroyed companies.

软件与物理系统的不同之处在于它是无定形的——它的物理形式(1 和 0)与其实际功能没有任何相似之处。我们从来没有想到,一个500人的小村庄会在一夜之间变成一个拥有1000万人口的城市。但有时我们期望我们的软件系统能够突然处理其设计数量的一千倍。毫不奇怪,结果很少是美好的。

Software is unlike physical systems in that it’s amorphous—its physical form (1s and 0s) bears no resemblance to its actual capabilities. We’d never expect to transform a small village of 500 people into a city of 10 million overnight. But we sometimes expect our software systems to suddenly handle one thousand times the number of requests they were designed for. Unsurprisingly, the outcomes are rarely pretty.

这本书适合谁

Who This Book Is For

本书的主要目标读者是对分布式并发系统经验为零或有限的软件工程师和架构师。他们需要加深理论和实践设计知识,以应对构建更大规模、通常面向互联网的应用程序的挑战。

The major target audience for this book is software engineers and architects who have zero or limited experience with distributed, concurrent systems. They need to deepen both their theoretical and practical design knowledge in order to meet the challenges of building larger-scale, typically internet-facing applications.

你将学到什么

What You Will Learn

本书从可扩展性的角度介绍了并发和分布式系统的概况。虽然不可能将可扩展性与其他架构质量完全分开,但可扩展性是讨论的主要焦点。当然,其他品质也必然会发挥作用,其中性能、可用​​性和一致性经常引起人们的注意。

This book covers the landscape of concurrent and distributed systems through the lens of scalability. While it’s impossible to totally divorce scalability from other architectural qualities, scalability is the main focus of discussion. Of course, other qualities necessarily come into play, with performance, availability, and consistency regularly raising their heads.

构建分布式系统需要对分布式和并发性有一些基本的了解——这些知识是本书中反复出现的主题。之所以需要它,是因为分布式系统中存在两个使它们变得复杂的基本问题,正如我在下面所描述的。

Building distributed systems requires some fundamental understanding of distribution and concurrency—this knowledge is a recurrent theme throughout this book. It’s needed because of the two essential problems in distributed systems that make them complex, as I describe below.

首先,尽管系统作为一个整体几乎始终运行完美,但系统的个别部分可能随时出现故障。当某个组件发生故障时(无论是由于硬件崩溃、网络中断、服务器错误等),我们必须采用使系统整体能够继续运行并从故障中恢复的技术。每个分布式系统都会遇到组件故障,通常以奇怪、神秘和不可预见的方式发生。

First, although systems as a whole operate perfectly correctly nearly all the time, an individual part of the system may fail at any time. When a component fails (whether due to a hardware crash, network outage, bug in a server, etc.), we have to employ techniques that enable the system as a whole to continue operations and recover from failures. Every distributed system will experience component failure, often in weird, mysterious, and unanticipated ways.

其次,创建可扩展的分布式系统需要协调多个活动部件。系统的每个组件都需要保持自己的份额并尽快处理请求。如果只有一个组件导致请求延迟,整个系统可能会表现不佳,甚至最终崩溃。

Second, creating a scalable distributed system requires the coordination of multiple moving parts. Each component of the system needs to keep its part of the bargain and process requests as quickly as possible. If just one component causes requests to be delayed, the whole system may perform poorly and even eventually crash.

有大量的文献可以帮助您解决这些问题。对于我们工程师来说幸运的是,还有大量技术旨在帮助我们构建容错和可扩展的分布式系统。这些技术体现了理论方法和复杂的算法,很难正确构建。利用这些平台级、广泛适用的技术,我们的应用程序可以站在巨人的肩膀上,使我们能够构建复杂的业务解决方案。

There is a rich body of literature available to help you deal with these problems. Luckily for us engineers, there’s also an extensive collection of technologies that are designed to help us build distributed systems that are tolerant to failure and scalable. These technologies embody theoretical approaches and complex algorithms that are incredibly hard to build correctly. Using these platform-level, widely applicable technologies, our applications can stand on the shoulders of giants, enabling us to build sophisticated business solutions.

具体来说,本书的读者将学到:

Specifically, readers of this book will learn:

  • 分布式系统的基本特征,包括状态管理、时间协调、并发、通信和协调

  • The fundamental characteristics of distributed systems, including state management, time coordination, concurrency, communications, and coordination

  • 用于构建可扩展、强大服务的架构方法和支持技术

  • Architectural approaches and supporting technologies for building scalable, robust services

  • 分布式数据库如何运行以及如何用于构建可扩展的分布式系统

  • How distributed databases operate and can be used to build scalable distributed systems

  • 用于构建流式、基于事件的系统的架构和技术,例如 Apache Kafka 和 Flink

  • Architectures and technologies such as Apache Kafka and Flink for building streaming, event-based systems

教育工作者须知

Note for Educators

本书的大部分内容都是在东北大学的高级本科生/研究生课程的背景下编写的。事实证明,这是一种非常流行且有效的方法,可以让学生掌握在大型互联网公司开启职业生涯所需的知识和技能。本书网站上提供了其他材料,可以为希望在课程中使用本书的教育工作者提供支持。

Much of the content of this book has been developed in the context of an advanced undergraduate/graduate course at Northeastern University. It has proven a very popular and effective approach for equipping students with the knowledge and skills needed to launch their careers with major internet companies. Additional materials on the book website are available to support educators who wish to use the book for their course.

本书中使用的约定

Conventions Used in This Book

本书使用以下印刷约定:

The following typographical conventions are used in this book:

斜体
Italic

表示新术语、URL、电子邮件地址、文件名和文件扩展名。

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width
Constant width

用于程序列表,以及在段落中引用程序元素,例如变量或函数名称、数据库、数据类型、环境变量、语句和关键字。

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

Constant width bold
Constant width bold

显示应由用户逐字键入的命令或其他文本。

Shows commands or other text that should be typed literally by the user.

Constant width italic
Constant width italic

显示应替换为用户提供的值或上下文确定的值的文本。

Shows text that should be replaced with user-supplied values or by values determined by context.

笔记

该元素表示一般注释。

This element signifies a general note.

警告

该元素表示警告或警告。

This element indicates a warning or caution.

使用代码示例

Using Code Examples

补充材料(代码示例、练习等)可从https://oreil.ly/fss-git-repo下载。

Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/fss-git-repo.

如果您有技术问题或使用代码示例时遇到问题,请发送电子邮件至

If you have a technical question or a problem using the code examples, please send email to .

本书旨在帮助您完成工作。一般来说,如果本书提供了示例代码,您就可以在您的程序和文档中使用它。除非您要复制大部分代码,否则您无需联系我们以获得许可。例如,编写使用本书中的几段代码的程序不需要许可。销售或分发 O'Reilly 书籍中的示例确实需要许可。通过引用本书和示例代码来回答问题不需要许可。将本书中的大量示例代码合并到您的产品文档中确实需要许可。

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

我们赞赏但通常不要求归属。归属通常包括标题、作者、出版商和 ISBN。例如:“可扩展解决方案的基础”,作者:Ian Gorton (O'Reilly)。版权所有 2022 伊恩·戈顿,978-1-098-10606-5。”

We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Foundations of Scalable Solutions by Ian Gorton (O’Reilly). Copyright 2022 Ian Gorton, 978-1-098-10606-5.”

如果您认为您对代码示例的使用不符合合理使用或上述许可的范围,请随时通过与我们联系。

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

奥莱利在线学习

O’Reilly Online Learning

笔记

40 多年来,O'Reilly Media一直提供技术和业务培训、知识和见解来帮助公司取得成功。

For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

我们独特的专家和创新者网络通过书籍、文章和我们的在线学习平台分享他们的知识和专业知识。O'Reilly 的在线学习平台让您可以按需访问实时培训课程、深入学习路径、交互式编码环境以及来自 O'Reilly 和 200 多家其他出版商的大量文本和视频。欲了解更多信息,请访问https://oreilly.com

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.

如何联系我们

How to Contact Us

请向出版商提出有关本书的意见和问题:

Please address comments and questions concerning this book to the publisher:

  • 奥莱利媒体公司
  • O’Reilly Media, Inc.
  • 格拉文斯坦公路北1005号
  • 1005 Gravenstein Highway North
  • 塞瓦斯托波尔, CA 95472
  • Sebastopol, CA 95472
  • 800-998-9938(美国或加拿大)
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515(国际或本地)
  • 707-829-0515 (international or local)
  • 707-829-0104(传真)
  • 707-829-0104 (fax)

我们有本书的网页,其中列出了勘误表、示例和任何其他信息。您可以通过https://oreil.ly/scal-sys访问此页面。

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/scal-sys.

发送电子邮件至发表评论或询问有关本书的技术问题。

Email to comment or ask technical questions about this book.

有关我们的书籍和课程的新闻和信息,请访问https://oreilly.com

For news and information about our books and courses, visit https://oreilly.com.

在 LinkedIn 上找到我们: https: //linkedin.com/company/oreilly-media

Find us on LinkedIn: https://linkedin.com/company/oreilly-media

在 Twitter 上关注我们: https: //twitter.com/oreillymedia

Follow us on Twitter: https://twitter.com/oreillymedia

在 YouTube 上观看我们: https: //www.youtube.com/oreillymedia

Watch us on YouTube: https://www.youtube.com/oreillymedia

致谢

Acknowledgments

如果没有我的研究生院导师 Jon Kerridge 教授给我的启发,这一切工作都不会发生。他无限的热情激励着我从事这项工作三十年。

None of this work would ever have happened without the inspiration afforded to me by my graduate school advisor, Professor Jon Kerridge. His boundless enthusiasm has fueled me in this work for three decades.

卡内基梅隆大学的马特·巴斯和约翰·克莱因在该项目的早期阶段提供了宝贵的资源。我感谢他们对整个可扩展软件架构进行的精彩讨论。

Matt Bass and John Klein from Carnegie Mellon University were invaluable resources in the early stages of this project. I thank them for the great discussions about the whole spectrum of scalable software architectures.

我的审稿人非常优秀——勤奋且富有洞察力——让我走在正确的道路上。永远感谢马克·理查兹 (Mark Richards)、马特·斯坦 (Matt Stine)、蒂亚古·帕拉尼萨米 (Thiyagu Palanisamy)、杰西·马尔斯 (Jess Males)、奥尔汗·侯赛因利 (Orkhan Huseynli)、阿德南·拉希德 (Adnan Rashid) 和尼拉夫·阿加 (Nirav Aga)。非常感谢弗吉尼亚·威尔逊纠正了我的古怪用词!

My reviewers have been excellent—diligent and insightful—and have kept me on the right track. Eternal gratitude is due to Mark Richards, Matt Stine, Thiyagu Palanisamy, Jess Males, Orkhan Huseynli, Adnan Rashid, and Nirav Aga. And many thanks to Virginia Wilson for fixing my wonky words!

我还要感谢西雅图东北大学 CS6650 构建可扩展分布式系统课程中的所有学生,特别是肖锐杰。关于如何最好地传达本书中涵盖的许多复杂概念,您为我提供了宝贵的反馈。你们是有史以来最好的豚鼠!

I’d also like to thank all my students, and especially Ruijie Xiao, in the CS6650 Building Scalable Distributed Systems course at Northeastern University in Seattle. You’ve provided me with invaluable feedback on how best to communicate the many complex concepts covered in this book. You are the best guinea pigs ever!

第一部分:基础知识

Part I. The Basics

本书第一部分的前四章主张将可扩展性作为现代软件系统的关键架构属性。这些章节广泛介绍了实现可扩展性的基本机制、分布式系统的基本特征以及并发编程的介绍。这些知识为接下来的内容奠定了基础,如果您是分布式并发系统领域的新手,则需要花一些时间在这些章节上。它们将使本书的其余部分更容易理解。

The first four chapters in Part I of this book advocate the need for scalability as a key architectural attribute in modern software systems. These chapters provide broad coverage of the basic mechanisms for achieving scalability, the fundamental characteristics of distributed systems, and an introduction to concurrent programming. This knowledge lays the foundation for what follows, and if you are new to the areas of distributed, concurrent systems, you’ll need to spend some time on these chapters. They will make the rest of the book much easier to digest.

第 1 章可扩展系统简介

Chapter 1. Introduction to Scalable Systems

过去 20 年,软件系统的规模、复杂性和容量出现了前所未有的增长。这种增长速度在未来 20 年里不太可能放缓——未来的系统会是什么样子现在几乎难以想象。然而,我们可以保证的一件事是,越来越多的软件系统将需要以不断增长的方式构建——更多的请求、更多的数据和更多的分析——作为主要的设计驱动力。

The last 20 years have seen unprecedented growth in the size, complexity, and capacity of software systems. This rate of growth is hardly likely to slow in the next 20 years—what future systems will look like is close to unimaginable right now. However, one thing we can guarantee is that more and more software systems will need to be built with constant growth—more requests, more data, and more analysis—as a primary design driver.

可扩展是软件工程中使用的术语,用于描述可以适应增长的软件系统。在本章中,我将探讨扩展能力的确切含义,即(毫不奇怪地)称为可扩展性。我还将描述一些示例,这些示例对当代应用程序的功能和特征进行了硬性统计,并简要介绍了我们今天通常构建的大型系统的起源。最后,我将描述实现可扩展性、复制和优化的两个一般原则,这两个原则将以各种形式在本书的其余部分中重复出现,并检查可扩展性和其他软件架构质量属性之间不可磨灭的联系。

Scalable is the term used in software engineering to describe software systems that can accommodate growth. In this chapter I’ll explore what precisely is meant by the ability to scale, known (not surprisingly) as scalability. I’ll also describe a few examples that put hard numbers on the capabilities and characteristics of contemporary applications and give a brief history of the origins of the massive systems we routinely build today. Finally, I’ll describe two general principles for achieving scalability, replication and optimization, which will recur in various forms throughout the rest of this book, and examine the indelible link between scalability and other software architecture quality attributes.

什么是可扩展性?

What Is Scalability?

直观上,可扩展性是一个非常简单的概念。如果我们向维基百科询问定义它会告诉我们:“可扩展性是系统的属性,通过向系统添加资源来处理不断增长的工作量。” 我们都知道如何扩展高速公路系统——我们添加更多车道,以便它可以处理更多数量的车辆。我最喜欢的一些人知道如何扩大啤酒生产规模——他们在酿造容器的数量和大小、执行和管理酿造过程的员工数量以及可以填充新鲜啤酒的小桶数量方面增加了更多的产能。美味的啤酒。想想任何物理系统——交通系统、机场、建筑物中的电梯——我们如何增加容量是非常明显的。

Intuitively, scalability is a pretty straightforward concept. If we ask Wikipedia for a definition, it tells us, “Scalability is the property of a system to handle a growing amount of work by adding resources to the system.” We all know how we scale a highway system—we add more traffic lanes so it can handle a greater number of vehicles. Some of my favorite people know how to scale beer production—they add more capacity in terms of the number and size of brewing vessels, the number of staff to perform and manage the brewing process, and the number of kegs they can fill with fresh, tasty brews. Think of any physical system—a transit system, an airport, elevators in a building—and how we increase capacity is pretty obvious.

与物理系统不同,软件系统有些无定形。它们不是你可以指着、看到、触摸、感觉的东西,也不能从外部观察中了解它的内部行为。软件系统是一个数字工件。从本质上讲,组成可执行代码和数据的 1 和 0 流任何人都很难区分。那么,可扩展性对于软件系统来说意味着什么呢?

Unlike physical systems, software systems are somewhat amorphous. They are not something you can point at, see, touch, feel, and get a sense of how it behaves internally from external observation. A software system is a digital artifact. At its core, the stream of 1s and 0s that make up executable code and data are hard for anyone to tell apart. So, what does scalability mean in terms of a software system?

简而言之,无需陷入定义之争,可扩展性定义了软件系统处理其操作某些方面的增长的能力。操作维度的示例有:

Put very simply, and without getting into definition wars, scalability defines a software system’s capability to handle growth in some dimension of its operations. Examples of operational dimensions are:

  • 系统可以处理的同时用户或外部(例如传感器)请求的数量

  • The number of simultaneous user or external (e.g., sensor) requests a system can process

  • 系统可以有效处理和管理的数据量

  • The amount of data a system can effectively process and manage

  • 通过预测分析可以从系统存储的数据中获得的价值

  • The value that can be derived from the data a system stores through predictive analytics

  • 随着请求负载的增长,能够保持稳定、一致的响应时间

  • The ability to maintain a stable, consistent response time as the request load grows

例如,假设一家大型连锁超市正在迅速开设新店,并增加每家商店中自助结账亭的数量。这就要求超市核心软件系统能够完成以下功能:

For example, imagine a major supermarket chain is rapidly opening new stores and increasing the number of self-checkout kiosks in every store. This requires the core supermarket software systems to perform the following functions:

  • 处理因项目扫描而增加的数量,而不减少响应时间。为了让顾客满意,对商品扫描的即时响应是必要的。

  • Handle increased volume from item scanning without decreased response time. Instantaneous responses to item scans are necessary to keep customers happy.

  • 处理和存储因销售额增加而产生的大量数据。库存管理、会计、规划以及可能的许多其他功能都需要这些数据。

  • Process and store the greater data volumes generated from increased sales. This data is needed for inventory management, accounting, planning, and likely many other functions.

  • 从每个商店、地区和国家获得“实时”(例如每小时)销售数据摘要,并与历史趋势进行比较。这些趋势数据可以帮助突出显示该地区的异常事件(意外的天气条件、活动中的大量人群等),并帮助受影响的商店快速做出响应。

  • Derive “real-time” (e.g., hourly) sales data summaries from each store, region, and country and compare to historical trends. This trend data can help highlight unusual events in regions (unexpected weather conditions, large crowds at events, etc.) and help affected stores to quickly respond.

  • 改进库存订购预测子系统,以便能够随着商店和客户数量的增长正确预测销售情况(从而预测库存重新订购的需要)。

  • Evolve the stock ordering prediction subsystem to be able to correctly anticipate sales (and hence the need for stock reordering) as the number of stores and customers grow.

这些维度实际上是系统的可扩展性要求。如果连锁超市在一年多时间里开设了 100 家新店,销售额增长了 400 倍(其中一些新店规模很大!),那么软件系统需要进行扩展,以提供必要的处理能力,使超市能够高效运营。如果系统无法扩展,当客户不满意时我们可能会失去销售额。我们可能持有不会很快售出的库存,从而增加成本。我们可能会错过因应当地情况而推出特别产品来增加销量的机会。所有这些因素都会降低客户满意度和利润。没有一个对商业有利。

These dimensions are effectively the scalability requirements of the system. If, over a year, the supermarket chain opens 100 new stores and grows sales by 400 times (some of the new stores are big!), then the software system needs to scale to provide the necessary processing capacity to enable the supermarket to operate efficiently. If the systems don’t scale, we could lose sales when customers become unhappy. We might hold stock that will not be sold quickly, increasing costs. We might miss opportunities to increase sales by responding to local circumstances with special offerings. All these factors reduce customer satisfaction and profits. None are good for business.

因此,成功的扩展对于我们想象中的超市的业务增长至关重要,实际上也是许多现代互联网应用程序的命脉。但对于大多数企业和政府系统来说,可扩展性并不是开发和部署早期阶段的首要质量要求。增强可用性和实用性的新功能成为我们开发周期的驱动力。只要正常负载下性能足够,我们就会不断添加面向用户的功能来提升系统的商业价值。事实上,在有明确的需求之前引入我将在本书中描述的一些复杂的分布式技术实际上可能对项目有害,因为额外的复杂性会导致开发惯性。

Successfully scaling is therefore crucial for our imaginary supermarket’s business growth, and likewise is in fact the lifeblood of many modern internet applications. But for most business and government systems, scalability is not a primary quality requirement in the early stages of development and deployment. New features to enhance usability and utility become the drivers of our development cycles. As long as performance is adequate under normal loads, we keep adding user-facing features to enhance the system’s business value. In fact, introducing some of the sophisticated distributed technologies I’ll describe in this book before there is a clear requirement can actually be deleterious to a project, with the additional complexity causing development inertia.

尽管如此,系统发展到增强性能和可扩展性成为紧迫问题甚至生存问题的情况并不罕见。有吸引力的功能和高实用性孕育了成功,这带来了更多的请求需要处理和更多的数据需要管理。这通常预示着一个转折点,在轻负载下有意义的设计决策突然变成了技术债务。1外部触发事件通常会导致这些临界点:请参阅 2020 年 3 月/4 月媒体,了解有关政府失业和超市在线订购网站因冠状病毒大流行导致的需求而崩溃的许多报道。

Still, it’s not uncommon for systems to evolve into a state where enhanced performance and scalability become a matter of urgency, or even survival. Attractive features and high utility breed success, which brings more requests to handle and more data to manage. This often heralds a tipping point, wherein design decisions that made sense under light loads suddenly become technical debt.1 External trigger events often cause these tipping points: look in the March/April 2020 media for the many reports of government unemployment and supermarket online ordering sites crashing under demand caused by the coronavirus pandemic.

通过增加资源来增加系统在某些方面的容量称为纵向扩展横向扩展——稍后我将探讨它们之间的区别。此外,与物理系统不同,能够缩小系统容量以降低成本通常同样重要。

Increasing a system’s capacity in some dimension by increasing resources is called scaling up or scaling out—I’ll explore the difference between these later. In addition, unlike physical systems, it is often equally important to be able to scale down the capacity of a system to reduce costs.

典型的例子是 Netflix,它需要处理可预测的区域昼夜负载。简而言之,在任何地理区域,晚上 9 点观看 Netflix 的人数都比早上 5 点观看的人数要多得多,这使得 Netflix 能够在负载较低时减少其处理资源。这节省了运行亚马逊云中使用的处理节点的成本,以及降低数据中心功耗等对社会有价值的事情。将其与高速公路进行比较。晚上路上车少的时候,我们不会收回车道(维修除外)。道路的全部通行能力可供少数驾驶员随心所欲地行驶。在软件系统中,我们可以在几秒钟内扩展和收缩我们的处理能力,以满足瞬时负载。与物理系统相比,

The canonical example of this is Netflix, which has a predictable regional diurnal load that it needs to process. Simply, a lot more people are watching Netflix in any geographical region at 9 p.m. than are at 5 a.m. This enables Netflix to reduce its processing resources during times of lower load. This saves the cost of running the processing nodes that are used in the Amazon cloud, as well as societally worthy things such as reducing data center power consumption. Compare this to a highway. At night when few cars are on the road, we don’t retract lanes (except to make repairs). The full road capacity is available for the few drivers to go as fast as they like. In software systems, we can expand and contract our processing capacity in a matter of seconds to meet instantaneous load. Compared to physical systems, the strategies we deploy are vastly different.

关于软件系统的可扩展性,还有很多事情需要考虑,但让我们在检查 2021 年左右的一些当代软件系统的规模后回到这些问题。

There’s a lot more to consider about scalability in software systems, but let’s come back to these issues after examining the scale of some contemporary software systems circa 2021.

2000 年代初的系统规模示例

Examples of System Scale in the Early 2000s

在这场技术游戏中展望未来总是充满危险。2008 年我写道:

Looking ahead in this technology game is always fraught with danger. In 2008 I wrote:

“虽然 PB 数据集和千兆位数据流是当今数据密集型应用程序的前沿,但毫无疑问,从现在开始 10 年后,我们将深深地回忆起这种规模的问题,并担心即将出现的百亿亿次应用程序所带来的困难。” 2

“While petabyte datasets and gigabit data streams are today’s frontiers for data-intensive applications, no doubt 10 years from now we’ll fondly reminisce about problems of this scale and be worrying about the difficulties that looming exascale applications are posing.”2

合理的情绪,这是事实,但是百亿亿级呢?这在当今世界几乎是司空见惯的事情。Google在 2014 年报告了数 EB 的Gmail ,而现在,所有 Google 服务都管理 1 YO 字节或更多吗?我不知道。我什至不确定我是否知道 yottabyte 是什么!谷歌不会告诉我们他们的存储情况,但我不会打赌它。同样,Amazon 在各种 AWS 数据存储中为其客户存储了多少数据?例如,对于所有支持的客户端应用程序,DynamoDB 每秒总共处理多少个请求?这些事情想得太久,你的头就会爆炸。

Reasonable sentiments, it is true, but exascale? That’s almost commonplace in today’s world. Google reported multiple exabytes of Gmail in 2014, and by now, do all Google services manage a yottabyte or more? I don’t know. I’m not even sure I know what a yottabyte is! Google won’t tell us about their storage, but I wouldn’t bet against it. Similarly, how much data does Amazon store in the various AWS data stores for their clients? And how many requests does, say, DynamoDB process per second, collectively, for all supported client applications? Think about these things for too long and your head will explode.

主要互联网公司的技术博客是一个重要的信息来源,有时可以洞察当代的运营规模。还有一些分析互联网流量的网站可以很好地说明流量。让我们举几个时间点的例子来说明我们今天所知道的一些事情。请记住,在一四年内这些将看起来几乎古怪:

A great source of information that sometimes gives insights into contemporary operational scales are the major internet companies’ technical blogs. There are also websites analyzing internet traffic that are highly illustrative of traffic volumes. Let’s take a couple of point-in-time examples to illustrate a few things we do know today. Bear in mind these will look almost quaint in a year or four:

  • Facebook 的工程博客介绍了Scribe,这是他们每小时收集、聚合和交付 PB 级日志数据的解决方案,具有低延迟和高吞吐量。Facebook 的计算基础设施由数百万台机器组成,每台机器都会生成日志文件,捕获与系统和应用程序运行状况相关的重要事件。处理这些日志文件(例如来自 Web 服务器的日志文件)可以让开发团队深入了解其应用程序的行为和性能,并支持故障查找。Scribe 是一种自定义缓冲排队解决方案,可以以每秒几 TB 的速率传输来自服务器的日志,并将其传送到下游分析和数据仓库系统。我的朋友们,这是很多数据!

  • Facebook’s engineering blog describes Scribe, their solution for collecting, aggregating, and delivering petabytes of log data per hour, with low latency and high throughput. Facebook’s computing infrastructure comprises millions of machines, each of which generates log files that capture important events relating to system and application health. Processing these log files, for example from a web server, can give development teams insights into their application’s behavior and performance, and support faultfinding. Scribe is a custom buffered queuing solution that can transport logs from servers at a rate of several terabytes per second and deliver them to downstream analysis and data warehousing systems. That, my friends, is a lot of data!

  • 您可以在Internet Live Stats中查看众多服务的实时 Internet 流量。仔细研究一下,你会发现一些惊人的统计数据;例如,Google 每天处理约 35 亿个搜索请求,Instagram 用户每天上传约 6500 万张照片,网站数量约为 17 亿个。这是一个有趣的网站,有很多信息。请注意,这些数据不是真实的,而是基于多个数据源的统计分析的估计值。

  • You can see live internet traffic for numerous services at Internet Live Stats. Dig around and you’ll find some staggering statistics; for example, Google handles around 3.5 billion search requests per day, Instagram users upload about 65 million photos per day, and there are something like 1.7 billion websites. It is a fun site with lots of information. Note that the data is not real, but rather estimates based on statistical analyses of multiple data sources.

  • 2016年,谷歌发表了一篇论文,描述了其代码库的特点。在报告的众多令人震惊的事实中,有一个事实是“该存储库包含 86 TB 的数据,其中包括 900 万个独特源文件中的大约 20 亿行代码。” 请记住,这是 2016 年。3

  • In 2016, Google published a paper describing the characteristics of its codebase. Among the many startling facts reported is the fact that “The repository contains 86 TBs of data, including approximately two billion lines of code in nine million unique source files.” Remember, this was 2016.3

尽管如此,主要互联网网站提供的服务规模的真实、具体数据仍然处于商业机密之中。幸运的是,我们可以通过一家科技公司的年度使用报告深入了解互联网规模处理的请求和数据量。但要小心,因为它来自 Pornhub。4您可以在此处浏览2019 年以来极其详细的使用统计数据。这是对大规模系统能力的一次令人着迷的一瞥。

Still, real, concrete data on the scale of the services provided by major internet sites remain shrouded in commercial-in-confidence secrecy. Luckily, we can get some deep insights into the request and data volumes handled at internet scale through the annual usage report from one tech company. Beware though, as it is from Pornhub.4 You can browse their incredibly detailed usage statistics from 2019 here. It’s a fascinating glimpse into the capabilities of massive-scale systems.

我们是怎么来到这里的?系统发展简史

How Did We Get Here? A Brief History of System Growth

我相信许多读者很难相信在互联网搜索、YouTube 和社交媒体出现之前就已经存在文明生活了。事实上,第一个视频上传到 YouTube是在 2005 年。是的,连我都难以置信。那么,让我们简要回顾一下我们是如何达到当今系统规模的。以下是一些值得注意的历史里程碑:

I am sure many readers will have trouble believing there was civilized life before internet searching, YouTube, and social media. In fact, the first video upload to YouTube occurred in 2005. Yep, it is hard even for me to believe. So, let’s take a brief look back in time at how we arrived at the scale of today’s systems. Below are some historical milestones of note:

20世纪80年代
1980s
一个由分时大型机和小型机主导的时代。PC 出现于 20 世纪 80 年代初,但很少联网。到 20 世纪 80 年代末,开发实验室、大学和(越来越多的)企业都拥有了电子邮件和原始互联网资源。
An age dominated by time-shared mainframes and minicomputers. PCs emerged in the early 1980s but were rarely networked. By the end of the 1980s, development labs, universities, and (increasingly) businesses had email and access to primitive internet resources.
1990–95
1990–95
网络变得更加普遍,为使用 HTTP/HTML 技术创建万维网 (WWW) 创建了一个成熟的环境,该技术由 Tim Berners-Lee在 20 世纪 80 年代在 CERN 首创。到 1995 年,网站的数量还很少,但像 Yahoo! 这样的公司已经播下了未来的种子。1994 年,亚马逊和 eBay 于 1995 年。
Networks became more pervasive, creating an environment ripe for the creation of the World Wide Web (WWW) with HTTP/HTML technology that had been pioneered at CERN by Tim Berners-Lee during the 1980s. By 1995, the number of websites was tiny, but the seeds of the future were planted with companies like Yahoo! in 1994 and Amazon and eBay in 1995.
1996–2000
1996–2000
网站数量从1万个左右增长到1000万个,这是一个真正的爆炸性增长时期。网络带宽和访问也快速增长。亚马逊、eBay、谷歌和雅虎等公司 我们为我们今天所了解和使用的高度可扩展的系统开创了许多设计原则和先进技术的早期版本。日常企业争先恐后地利用电子商务提供的新机会,这使得系统可扩展性变得突出,正如侧边栏“规模如何影响业务系统”中所解释的那样。
The number of websites grew from around 10,000 to 10 million, a truly explosive growth period. Networking bandwidth and access also grew rapidly. Companies like Amazon, eBay, Google, and Yahoo! were pioneering many of the design principles and early versions of advanced technologies for highly scalable systems that we know and use today. Everyday businesses rushed to exploit the new opportunities that e-business offered, and this brought system scalability to prominence, as explained in the sidebar “How Scale Impacted Business Systems”.
2000年–2006年
2000–2006
在此期间,网站数量从大约 1000 万个增长到 8000 万个,新的服务和商业模式也出现了。2005年,YouTube推出。2006 年 Facebook 向公众开放。同年,2004 年低调起步的亚马逊网络服务 (AWS) 重新推出 S3 和 EC2 服务。
The number of websites grew from around 10 million to 80 million during this period, and new service and business models emerged. In 2005, YouTube was launched. 2006 saw Facebook become available to the public. In the same year, Amazon Web Services (AWS), which had low-key beginnings in 2004, relaunched with its S3 and EC2 services.
2007 年至今
2007–today
我们现在生活在一个拥有大约 20 亿个网站的世界,其中大约 20% 是活跃的。互联网用户约有40 亿。由 AWS、谷歌云平台 (GCP) 和 Microsoft Azure 等公共云运营商运营的大型数据中心,以及无数私有数据中心(例如Twitter 的运营基础设施)分布在全球各地。云托管着数百万个应用程序,工程师使用复杂的云管理门户来配置和操作其计算和数据存储系统。强大的云服务使我们只需点击几下鼠标就可以构建、部署和扩展我们的系统。公司所要做的就是在月底支付云提供商的账单。
We now live in a world with around 2 billion websites, of which about 20% are active. There are something like 4 billion internet users. Huge data centers operated by public cloud operators like AWS, Google Cloud Platform (GCP), and Microsoft Azure, along with a myriad of private data centers, for example, Twitter’s operational infrastructure, are scattered around the planet. Clouds host millions of applications, with engineers provisioning and operating their computational and data storage systems using sophisticated cloud management portals. Powerful cloud services make it possible for us to build, deploy, and scale our systems literally with a few clicks of a mouse. All companies have to do is pay their cloud provider bill at the end of the month.

这就是本书所针对的世界。在这个世界中,我们的应用程序需要利用构建可扩展系统的关键原则并利用高度可扩展的基础设施平台。请记住,在现代应用程序中,执行的大部分代码不是由您的组织编写的。它是容器、数据库、消息传递系统以及您通过 API 调用和构建指令组合到应用程序中的其他组件的一部分。这使得这些组件的选择和使用至少与您自己的业务逻辑的设计和开发一样重要。它们是不容易改变的架构决策。

This is the world that this book targets. A world where our applications need to exploit the key principles for building scalable systems and leverage highly scalable infrastructure platforms. Bear in mind, in modern applications, most of the code executed is not written by your organization. It is part of the containers, databases, messaging systems, and other components that you compose into your application through API calls and build directives. This makes the selection and use of these components at least as important as the design and development of your own business logic. They are architectural decisions that are not easy to change.

可扩展性基本设计原则

Scalability Basic Design Principles

扩展系统的基本目标是增加其在某些特定于应用程序的维度上的容量。一个常见的维度是增加系统在给定时间段内可以处理的请求数量。这称为系统的吞吐量。让我们用一个类比来探讨我们可用于扩展系统和增加吞吐量的两个基本原则:复制和优化。

The basic aim of scaling a system is to increase its capacity in some application-specific dimension. A common dimension is increasing the number of requests that a system can process in a given time period. This is known as the system’s throughput. Let’s use an analogy to explore two basic principles we have available to us for scaling our systems and increasing throughput: replication and optimization.

1932 年,世界标志性工程奇迹之一悉尼海港大桥落成启用。现在,可以相当安全地假设 2021 年的交通量会比 1932 年有所增加。如果您在过去 30 年中曾在高峰时段开车过这座桥,那么您就会知道它的通行能力每天都会大大超出。那么我们如何提高桥梁等物理基础设施的吞吐量呢?

In 1932, one of the world’s iconic wonders of engineering, the Sydney Harbour Bridge, was opened. Now, it is a fairly safe assumption that traffic volumes in 2021 are somewhat higher than in 1932. If by any chance you have driven over the bridge at peak hour in the last 30 years, then you know that its capacity is exceeded considerably every day. So how do we increase throughput on physical infrastructures such as bridges?

这个问题在 20 世纪 80 年代的悉尼变得非常突出,当时人们意识到必须增加海港过境的容量。解决方案是不太具有标志性的悉尼海港隧道,它基本上遵循海港下方的相同路线。这提供了四个额外的交通车道,从而增加了大约三分之一的港口过境能力。在不远的奥克兰,他们的海港大桥也存在通行能力问题,因为它建于 1959 年,只有四车道。本质上,他们采取了与悉尼相同的解决方案,即增加运力。但他们并没有建造隧道,而是巧妙地通过用名字搞笑的“日本夹子”扩大桥梁,使车道数量增加了一倍。,加宽了桥的两侧。

This issue became very prominent in Sydney in the 1980s, when it was realized that the capacity of the harbor crossing had to be increased. The solution was the rather less iconic Sydney Harbour Tunnel, which essentially follows the same route underneath the harbor. This provides four additional lanes of traffic and hence added roughly one-third more capacity to harbor crossings. In not-too-far-away Auckland, their harbor bridge also had a capacity problem as it was built in 1959 with only four lanes. In essence, they adopted the same solution as Sydney, namely, to increase capacity. But rather than build a tunnel, they ingeniously doubled the number of lanes by expanding the bridge with the hilariously named “Nippon clip-ons”, which widened the bridge on each side.

这些例子说明了我们在软件系统中提高容量的第一个策略。我们基本上复制了软件处理资源,以提供更多的能力来处理请求,从而提高吞吐量,如图1-1所示。这些复制的处理资源类似于桥梁上的车道,为到达的请求流提供基本上独立的处理路径。

These examples illustrate the first strategy we have in software systems to increase capacity. We basically replicate the software processing resources to provide more capacity to handle requests and thus increase throughput, as shown in Figure 1-1. These replicated processing resources are analogous to the traffic lanes on bridges, providing a mostly independent processing pathway for a stream of arriving requests.

幸运的是,在基于云的软件系统中,只需点击鼠标即可实现复制,我们可以有效地复制我们的处理资源数千次。在这方面我们比桥梁建造者容易得多。尽管如此,我们仍然需要注意复制资源,以缓解真正的瓶颈。向未不堪重负的处理路径添加容量会增加不必要的成本,而不会提供可扩展性优势。

Luckily, in cloud-based software systems, replication can be achieved at the click of a mouse, and we can effectively replicate our processing resources thousands of times. We have it a lot easier than bridge builders in that respect. Still, we need to take care to replicate resources in order to alleviate real bottlenecks. Adding capacity to processing paths that are not overwhelmed will add needless costs without providing scalability benefit.

通过复制增加容量
图 1-1。通过复制增加容量

第二种可扩展性策略也可以通过我们的桥接示例来说明。在悉尼,一些细心的人发现,早上有更多的车辆从北向南过桥,而下午我们看到相反的情况。因此,一个聪明的解决方案被设计出来——在早上将更多的车道分配到高需求方向,并在下午的某个时间进行切换。这有效地增加了桥梁的容量,而无需分配任何新资源——我们优化了已有的可用资源。

The second strategy for scalability can also be illustrated with our bridge example. In Sydney, some observant person realized that in the mornings a lot more vehicles cross the bridge from north to south, and in the afternoon we see the reverse pattern. A smart solution was therefore devised—allocate more of the lanes to the high-demand direction in the morning, and sometime in the afternoon, switch this around. This effectively increased the capacity of the bridge without allocating any new resources—we optimized the resources we already had available.

我们可以在软件中遵循同样的方法来扩展我们的系统。如果我们能够通过使用更高效的算法、在数据库中添加额外的索引来加速查询、甚至用更快的编程语言重写我们的服务器来以某种方式优化我们的处理,那么我们就可以在不增加资源的情况下增加我们的容量。典型的例子是 Facebook为 PHP 创建的 HipHop(现已停产),它通过将 PHP 代码编译为 C++,将 Facebook 网页生成的速度提高了六倍。

We can follow this same approach in software to scale our systems. If we can somehow optimize our processing by using more efficient algorithms, adding extra indexes in our databases to speed up queries, or even rewriting our server in a faster programming language, we can increase our capacity without increasing our resources. The canonical example of this is Facebook’s creation of (the now discontinued) HipHop for PHP, which increased the speed of Facebook’s web page generation by up to six times by compiling PHP code to C++.

我将在本书中重新审视这两个设计原则,即复制和优化。您将看到,由于我们正在构建分布式系统,因此采用这些原则会产生许多复杂的影响。分布式系统具有使构建可扩展系统变得有趣的属性,在这种情况下,它既有积极的含义,也有消极的含义。

I’ll revisit these two design principles—namely replication and optimization—throughout this book. You will see that there are many complex implications of adopting these principles, arising from the fact that we are building distributed systems. Distributed systems have properties that make building scalable systems interesting, which in this context has both positive and negative connotations.

可扩展性和成本

Scalability and Costs

让我们用一个简单的假设示例来检查可扩展性和成本之间的关系。假设我们有一个基于 Web(例如 Web 服务器和数据库)的系统,可以为 100 个并发请求的负载提供服务,平均响应时间为 1 秒。我们收到一项业务需求,需要扩展该系统,以在相同的响应时间下处理 1,000 个并发请求。在不进行任何更改的情况下,对该系统进行简单的负载测试即可显示如图 1-2 (左)所示的性能。随着请求负载的增加,我们看到平均响应时间随着预计负载稳定增长到 10 秒。显然这不能满足我们当前部署配置的要求。系统无法扩展。

Let’s take a trivial hypothetical example to examine the relationship between scalability and costs. Assume we have a web-based (e.g., web server and database) system that can service a load of 100 concurrent requests with a mean response time of 1 second. We get a business requirement to scale up this system to handle 1,000 concurrent requests with the same response time. Without making any changes, a simple load test of this system reveals the performance shown in Figure 1-2 (left). As the request load increases, we see the mean response time steadily grow to 10 seconds with the projected load. Clearly this does not satisfy our requirements in its current deployment configuration. The system doesn’t scale.

扩展应用程序; 左侧表示不可扩展性能,右侧表示可扩展性能
图 1-2。扩展应用程序;左侧表示不可扩展性能,右侧表示可扩展性能

为了达到所需的性能,需要一些工程努力。图 1-2(右)显示了修改此工作后系统的性能。现在,它可以在 1,000 个并发请求下提供指定的响应时间。因此,我们已经成功扩展了系统。晚会时间!

Some engineering effort is needed in order to achieve the required performance. Figure 1-2 (right) shows the system’s performance after this effort has been modified. It now provides the specified response time with 1,000 concurrent requests. And so, we have successfully scaled the system. Party time!

然而,一个重大问题迫在眉睫。即,需要多少努力和资源才能实现这一绩效?也许这只是在更强大的(虚拟)机器上运行 Web 服务器的情况。在云上执行此类重新配置最多可能需要 30 分钟。稍微复杂一点的是重新配置系统以运行 Web 服务器的多个实例以增加容量。同样,这应该是一个简单、低成本的应用程序配置更改,无需更改代码。这些将是极好的结果。

A major question looms, however. Namely, how much effort and resources were required to achieve this performance? Perhaps it was simply a case of running the web server on a more powerful (virtual) machine. Performing such reprovisioning on a cloud might take 30 minutes at most. Slightly more complex would be reconfiguring the system to run multiple instances of the web server to increase capacity. Again, this should be a simple, low-cost configuration change for the application, with no code changes needed. These would be excellent outcomes.

然而,扩展系统并不总是那么容易。造成这种情况的原因多种多样,但以下是一些可能性:

However, scaling a system isn’t always so easy. The reasons for this are many and varied, but here are some possibilities:

  • 数据库的响应速度降低,每秒 1,000 个请求,需要升级到新机器。

  • The database becomes less responsive with 1,000 requests per second, requiring an upgrade to a new machine.

  • Web 服务器动态生成大量内容,这减少了负载下的响应时间。一种可能的解决方案是更改代码以更有效地生成内容,从而减少每个请求的处理时间。

  • The web server generates a lot of content dynamically and this reduces response time under load. A possible solution is to alter the code to more efficiently generate the content, thus reducing processing time per request.

  • 当许多请求尝试同时访问和更新相同的记录时,请求负载会在数据库中创建热点。这需要重新设计架构并随后重新加载数据库,以及对数据访问层的代码进行更改。

  • The request load creates hotspots in the database when many requests try to access and update the same records simultaneously. This requires a schema redesign and subsequent reloading of the database, as well as code changes to the data access layer.

  • 所选择的 Web 服务器框架强调开发的简易性而不是可扩展性。它强制执行的模型意味着代码根本无法扩展以满足请求的负载要求,并且需要完全重写。使用另一个框架?甚至使用另一种编程语言?

  • The web server framework that was selected emphasized ease of development over scalability. The model it enforces means that the code simply cannot be scaled to meet the requested load requirements, and a complete rewrite is required. Use another framework? Use another programming language even?

还有无数其他潜在原因,但希望这些说明了当我们从可能性 (1) 转向可能性 (4) 时可能需要付出越来越多的努力。

There’s a myriad of other potential causes, but hopefully these illustrate the increasing effort that might be required as we move from possibility (1) to possibility (4).

现在让我们假设选项 (1),即升级数据库服务器,需要 15 个小时的工作量以及每月 1000 美元的额外云成本才能获得更强大的服务器。这并不是非常昂贵。我们假设选项 (4),即重写 Web 应用程序层,由于实施新语言(例如,Java 而不是 Ruby),需要 10,000 小时的开发时间。选项(2)和(3)介于选项(1)和(4)之间。10,000 小时的开发成本非常巨大。更糟糕的是,在开发过程中,应用程序可能会因为无法满足客户请求的负载而失去市场份额,从而失去金钱。此类情况可能会导致系统和业务失败。

Now let’s assume option (1), upgrading the database server, requires 15 hours of effort and a thousand dollars in extra cloud costs per month for a more powerful server. This is not prohibitively expensive. And let’s assume option (4), a rewrite of the web application layer, requires 10,000 hours of development due to implementing a new language (e.g., Java instead of Ruby). Options (2) and (3) fall somewhere in between options (1) and (4). The cost of 10,000 hours of development is seriously significant. Even worse, while the development is underway, the application may be losing market share and hence money due to its inability to satisfy client requests’ loads. These kinds of situations can cause systems and businesses to fail.

这个简单的场景说明了资源和工作成本的维度如何与可扩展性密不可分地联系在一起。如果系统本质上不是按规模设计的,那么增加其容量以满足需求的下游成本和资源可能会很大。对于某些应用程序,例如HealthCare.gov,需要承担这些(超过 20 亿美元)成本,并对系统进行修改以最终满足业务需求。对于其他国家,例如俄勒冈州的医疗保健交易所,无法以低成本快速扩展可能会造成昂贵的损失(俄勒冈州的情况为 3.03 亿美元)。

This simple scenario illustrates how the dimensions of resource and effort costs are inextricably tied to scalability. If a system is not designed intrinsically to scale, then the downstream costs and resources of increasing its capacity to meet requirements may be massive. For some applications, such as HealthCare.gov, these (more than $2 billion) costs are borne and the system is modified to eventually meet business needs. For others, such as Oregon’s health care exchange, an inability to scale rapidly at low cost can be an expensive ($303 million, in Oregon’s case) death knell.

我们绝不会想到有人会尝试将郊区住宅的容量扩大为 50 层的办公楼。这座房子没有建筑、材料和地基,如果不完全拆除和重建,这甚至是一个遥远的可能性。同样,我们不应该期望不采用可扩展架构、机制和技术的软件系统能够快速发展以满足更大的容量需求。规模化的基础需要从一开始就建立起来,并认识到组件会随着时间的推移而发展。通过采用促进可扩展性的设计和开发原则,我们可以更快、更便宜地扩展系统,以满足快速增长的需求。我将在本书的第二部分中解释这些原则。

We would never expect someone would attempt to scale up the capacity of a suburban home to become a 50-floor office building. The home doesn’t have the architecture, materials, and foundations for this to be even a remote possibility without being completely demolished and rebuilt. Similarly, we shouldn’t expect software systems that do not employ scalable architectures, mechanisms, and technologies to be quickly evolved to meet greater capacity needs. The foundations of scale need to be built in from the beginning, with the recognition that the components will evolve over time. By employing design and development principles that promote scalability, we can more rapidly and cheaply scale up systems to meet rapidly growing demands. I’ll explain these principles in Part II of this book.

可以指数级扩展而成本线性增长的软件系统被称为超大规模系统,我将其定义如下:“超可扩展系统在计算和存储能力方面表现出指数增长,同时在构建、运营所需的资源成本方面表现出线性增长。” 、支持和发展所需的软件和硬件资源。” 您可以在本文中阅读有关超大规模系统的更多信息。

Software systems that can be scaled exponentially while costs grow linearly are known as hyperscale systems, which I define as follows: “Hyper scalable systems exhibit exponential growth in computational and storage capabilities while exhibiting linear growth rates in the costs of resources required to build, operate, support, and evolve the required software and hardware resources.” You can read more about hyperscale systems in this article.

可扩展性和架构的权衡

Scalability and Architecture Trade-Offs

可扩展性只是众多质量属性或非功能性需求之一,它们是软件架构学科的通用语言。软件架构持久的复杂性之一是质量属性权衡的必要性。基本上,有利于一种质量属性的设计可能会对其他质量属性产生负面或正面影响。例如,当我们的服务中发生某些事件时,我们可能希望写入日志消息,以便我们可以进行取证并支持代码的调试。但是,我们需要小心捕获的事件数量,因为日志记录会带来开销并对性能和成本产生负面影响。

Scalability is just one of the many quality attributes, or nonfunctional requirements, that are the lingua franca of the discipline of software architecture. One of the enduring complexities of software architecture is the necessity of quality attribute trade-offs. Basically, a design that favors one quality attribute may negatively or positively affect others. For example, we may want to write log messages when certain events occur in our services so we can do forensics and support debugging of our code. We need to be careful, however, how many events we capture, because logging introduces overheads and negatively affects performance and cost.

经验丰富的软件架构师不断谨慎行事,精心设计以满足高优先级的质量属性,同时最大限度地减少对其他质量属性的负面影响。

Experienced software architects constantly tread a fine line, crafting their designs to satisfy high-priority quality attributes, while minimizing the negative effects on other quality attributes.

可扩展性也不例外。当我们关注系统的可扩展能力时,我们必须仔细考虑我们的设计如何影响其他非常理想的属性,例如性能、可用​​性、安全性以及经常被忽视的可管理性。我将在以下各节中简要讨论其中一些固有的权衡。

Scalability is no different. When we point the spotlight at the ability of a system to scale, we have to carefully consider how our design influences other highly desirable properties such as performance, availability, security, and the oft overlooked manageability. I’ll briefly discuss some of these inherent trade-offs in the following sections.

表现

Performance

有一种简单的方法可以考虑性能和可扩展性之间的差异。当我们以性能为目标时,我们会尝试满足各个请求的一些所需指标。这可能是小于 2 秒的平均响应时间,或者是最坏情况的性能目标,例如第 99 个百分点的响应时间小于3 秒。

There’s a simple way to think about the difference between performance and scalability. When we target performance, we attempt to satisfy some desired metrics for individual requests. This might be a mean response time of less than 2 seconds, or a worst-case performance target such as the 99th percentile response time less than 3 seconds.

一般来说,提高性能对于可扩展性来说是一件好事。如果我们提高单个请求的性能,我们就会在系统中创建更多容量,这有助于我们提高可扩展性,因为我们可以使用未使用的容量来处理更多请求。

Improving performance is in general a good thing for scalability. If we improve the performance of individual requests, we create more capacity in our system, which helps us with scalability as we can use the unused capacity to process more requests.

然而,事情并不总是那么简单。我们可以通过多种方式减少响应时间。我们可以仔细优化我们的代码,例如,删除不必要的对象复制、使用更快的 JSON 序列化库,甚至用更快的编程语言完全重写代码。这些方法在不增加资源使用的情况下优化性能。

However, it’s not always that simple. We may reduce response times in a number of ways. We might carefully optimize our code by, for example, removing unnecessary object copying, using a faster JSON serialization library, or even completely rewriting code in a faster programming language. These approaches optimize performance without increasing resource usage.

另一种方法可能是通过将常用访问状态保留在内存中而不是在每个请求上写入数据库来优化单个请求。消除数据库访问几乎总能加快速度。但是,如果我们的系统长时间在内存中维护大量状态,我们可能(并且在负载较重的系统中)必须仔细管理我们的系统可以处理的请求数量。这可能会降低可扩展性,因为我们针对单个请求的优化方法比原始解决方案使用更多的资源(在本例中为内存),从而降低了系统容量。

An alternative approach might be to optimize individual requests by keeping commonly accessed state in memory rather than writing to the database on each request. Eliminating a database access nearly always speeds things up. However, if our system maintains large amounts of state in memory for prolonged periods, we may (and in a heavily loaded system, will) have to carefully manage the number of requests our system can handle. This will likely reduce scalability as our optimization approach for individual requests uses more resources (in this case, memory) than the original solution, and thus reduces system capacity.

我们将看到性能和可扩展性之间的这种紧张关系在本书中再次出现。事实上,有时明智的做法是让单个请求稍微慢一些,这样我们就可以利用额外的系统容量。当我在下一章讨论负载平衡时,描述了一个很好的例子。

We’ll see this tension between performance and scalability reappear throughout this book. In fact, it’s sometimes judicious to make individual requests slightly slower so we can utilize additional system capacity. A great example of this is described when I discuss load balancing in the next chapter.

可用性

Availability

可用性和可扩展性总体上是高度兼容的伙伴。当我们通过复制资源来扩展系统时,我们创建了多个可用于处理来自任何用户的请求的服务实例。如果我们的一个实例失败,其他实例仍然可用。系统只是由于资源发生故障、不可用而导致容量减少。类似的想法也适用于复制网络链路、网络路由器、磁盘以及计算系统中的几乎任何资源。

Availability and scalability are in general highly compatible partners. As we scale our systems through replicating resources, we create multiple instances of services that can be used to handle requests from any users. If one of our instances fails, the others remain available. The system just suffers from reduced capacity due to a failed, unavailable resource. Similar thinking holds for replicating network links, network routers, disks, and pretty much any resource in a computing system.

当涉及状态时,事情会因可扩展性和可用性而变得复杂。想想数据库。如果我们的单个数据库服务器过载,我们可以复制它并向任一实例发送请求。这也提高了可用性,因为我们可以容忍一个实例的故障。如果我们的数据库是只读的,这个方案就很有效。但是,一旦我们更新一个实例,我们就必须以某种方式弄清楚如何以及何时更新另一个实例。这就是副本一致性问题出现的丑陋之处。

Things get complicated with scalability and availability when state is involved. Think of a database. If our single database server becomes overloaded, we can replicate it and send requests to either instance. This also increases availability as we can tolerate the failure of one instance. This scheme works great if our databases are read only. But as soon as we update one instance, we somehow have to figure out how and when to update the other instance. This is where the issue of replica consistency raises its ugly head.

事实上,每当为了可扩展性和可用性而复制状态时,我们都必须处理一致性问题。这将是我在本书第三部分讨论分布式数据库时的一个主要主题。

In fact, whenever state is replicated for scalability and availability, we have to deal with consistency. This will be a major topic when I discuss distributed databases in Part III of this book.

安全

Security

安全性是一个复杂的、技术性很强的主题,值得专门写一本书。没有人愿意使用不安全的系统,被黑客攻击和泄露用户数据的系统会导致 CTO 辞职,在极端情况下,还会导致公司倒闭。

Security is a complex, highly technical topic worthy of its own book. No one wants to use an insecure system, and systems that are hacked and compromise user data cause CTOs to resign, and in extreme cases, companies to fail.

安全系统的基本要素是身份验证、授权和完整性。我们需要确保数据在网络传输过程中不会被拦截,并且静态数据(持久存储)无法被任何无权访问该数据的人访问。基本上,我不希望任何人看到我的信用卡号,因为它是在系统之间通信或存储在公司的数据库中。

The basic elements of a secure system are authentication, authorization, and integrity. We need to ensure data cannot be intercepted in transit over networks, and data at rest (persistent store) cannot be accessed by anyone who does not have permission to access that data. Basically, I don’t want anyone seeing my credit card number as it is communicated between systems or stored in a company’s database.

因此,安全性是任何面向互联网的系统的必要质量属性。构建安全系统的成本是无法避免的,因此让我们简要研究一下这些成本如何影响性能和可扩展性。

Hence, security is a necessary quality attribute for any internet-facing systems. The costs of building secure systems cannot be avoided, so let’s briefly examine how these affect performance and scalability.

在网络层面,系统通常会利用运行在 TCP/IP 之上的传输层安全 (TLS) 协议(请参阅第 3 章)。TLS 使用非对称加密技术提供加密、身份验证和完整性。由于双方都需要生成和交换密钥,因此建立安全连接会产生性能成本。TLS 连接建立还包括交换证书以验证服务器(以及可选的客户端)的身份,以及选择算法以检查数据在传输过程中未被篡改。一旦建立连接,传输中的数据就会使用对称加密技术进行加密,由于现代 CPU 具有专用的加密硬件,因此性能损失可以忽略不计。连接建立通常需要客户端和服务器之间进行两次消息交换,因此速度相对较慢。尽可能地重用连接可以最大限度地减少这些性能开销。

At the network level, systems routinely exploit the Transport Layer Security (TLS) protocol, which runs on top of TCP/IP (see Chapter 3). TLS provides encryption, authentication, and integrity using asymmetric cryptography. This has a performance cost for establishing a secure connection as both parties need to generate and exchange keys. TLS connection establishment also includes an exchange of certificates to verify the identity of the server (and optionally client), and the selection of an algorithm to check that the data is not tampered with in transit. Once a connection is established, in-flight data is encrypted using symmetric cryptography, which has a negligible performance penalty as modern CPUs have dedicated encryption hardware. Connection establishment usually requires two message exchanges between client and server, and is thus comparatively slow. Reusing connections as much as possible minimizes these performance overheads.

有多种选项可用于保护静态数据。SQL Server 和 Oracle 等流行的数据库引擎具有透明数据加密 (TDE) 等功能,可提供高效的文件级加密。金融等受监管行业越来越需要细粒度的加密机制,直至现场级别。云提供商也提供各种功能,确保存储在基于云的数据存储中的数据是安全的。静态安全数据的开销只是实现安全性必须承担的成本 - 研究表明开销在 5-10% 范围内。

There are multiple options for protecting data at rest. Popular database engines such as SQL Server and Oracle have features such as transparent data encryption (TDE) that provides efficient file-level encryption. Finer-grain encryption mechanisms, down to field level, are increasingly required in regulated industries such as finance. Cloud providers offer various features too, ensuring data stored in cloud-based data stores is secure. The overheads of secure data at rest are simply costs that must be borne to achieve security—studies suggest the overheads are in the 5–10% range.

安全的另一个视角是CIA 三合会,它代表机密性完整性可用性。前两个几乎就是我上面所描述的。可用性是指系统在对手攻击下可靠运行的能力。此类攻击可能是试图利用系统设计缺陷来导致系统瘫痪。另一种攻击是经典的分布式拒绝服务 (DDoS),其中攻击者获得对多个系统和设备的控制权,并协调大量请求,从而有效地使系统不可用。

Another perspective on security is the CIA triad, which stands for confidentiality, integrity, and availability. The first two are pretty much what I have described above. Availability refers to a system’s ability to operate reliably under attack from adversaries. Such attacks might be attempts to exploit a system design weakness to bring the system down. Another attack is the classic distributed denial-of-service (DDoS), in which an adversary gains control over multitudes of systems and devices and coordinates a flood of requests that effectively make a system unavailable.

一般来说,安全性和可扩展性是对立的力量。安全必然会导致性能下降。系统包含的安全层越多,性能和可扩展性的负担就越大。这最终会影响底线——需要更强大和更昂贵的资源来实现系统的性能和可扩展性要求。

In general, security and scalability are opposing forces. Security necessarily introduces performance degradation. The more layers of security a system encompasses, then a greater burden is placed on performance, and hence scalability. This eventually affects the bottom line—more powerful and expensive resources are required to achieve a system’s performance and scalability requirements.

可管理性

Manageability

随着我们构建的系统在交互方面变得更加分布式和复杂,它们的管理和操作就变得尤为突出。我们需要注意确保每个组件都按预期运行,并且性能持续满足预期。

As the systems we build become more distributed and complex in their interactions, their management and operations come to the fore. We need to pay attention to ensuring every component is operating as expected, and the performance is continuing to meet expectations.

我们用于构建系统的平台和技术提供了多种可用于这些目的的基于标准的专有监控工具。监控仪表板可用于检查每个系统组件的持续运行状况和行为。这些仪表板使用Grafana等高度可定制的开放工具构建,可以显示系统指标,并在发生需要操作员注意的各种阈值或事件时发送警报。用于描述这种复杂监控能力的术语是可观察性

The platforms and technologies we use to build our systems provide a multitude of standards-based and proprietary monitoring tools that can be used for these purposes. Monitoring dashboards can be used to check the ongoing health and behavior of each system component. These dashboards, built using highly customizable and open tools such as Grafana, can display system metrics and send alerts when various thresholds or events occur that need operator attention. The term used for this sophisticated monitoring capability is observability.

工程师可以利用各种 API 来捕获系统的自定义指标,例如 Java 的MBeans、AWS CloudWatch和 Python 的AppMetrics ,典型的例子是请求响应时间。使用这些 API,可以定制监控仪表板以提供实时图表和图形,从而深入了解系统的行为。这些见解对于确保持续运营并突出显示系统中可能需要优化或复制的部分非常宝贵。

There are various APIs such as Java’s MBeans, AWS CloudWatch and Python’s AppMetrics that engineers can utilize to capture custom metrics for their systems—a typical example is request response times. Using these APIs, monitoring dashboards can be tailored to provide live charts and graphs that give deep insights into a system’s behavior. Such insights are invaluable to ensure ongoing operations and highlight parts of the system that may need optimization or replication.

扩展系统总是意味着添加新的系统组件——硬件和软件。随着组件数量的增加,我们需要监控和管理更多的移动部件。这从来都不是不费力气的。它增加了系统操作的复杂性以及需要开发和可观测平台演进的监控代码的成本。

Scaling a system invariably means adding new system components—hardware and software. As the number of components grows, we have more moving parts to monitor and manage. This is never effort-free. It adds complexity to the operations of the system and costs in terms of monitoring code that requires developing and observability platform evolution.

随着规模的扩大,控制成本和可管理性复杂性的唯一方法是通过自动化。这就是 DevOps 世界登场的地方。DevOps是一组将软件开发和系统运营结合起来的实践和工具。DevOps 缩短了新功能的开发生命周期,并自动执行系统的持续测试、部署、管理、升级和监控。它是任何成功的可扩展系统不可或缺的一部分。

The only way to control the costs and complexity of manageability as we scale is through automation. This is where the world of DevOps enters the scene. DevOps is a set of practices and tooling that combine software development and system operations. DevOps reduces the development lifecycle for new features and automates ongoing test, deployment, management, upgrade, and monitoring of the system. It’s an integral part of any successful scalable system.

总结和延伸阅读

Summary and Further Reading

快速且经济高效地扩展应用程序的能力应该是当代面向互联网的应用程序的软件架构的定义品质。我们有两种基本方法来实现可扩展性,即增加系统容量(通常通过复制)和系统组件的性能优化。

The ability to scale an application quickly and cost-effectively should be a defining quality of the software architecture of contemporary internet-facing applications. We have two basic ways to achieve scalability, namely increasing system capacity, typically through replication, and performance optimization of system components.

与任何软件架构质量属性一样,可扩展性无法单独实现。它不可避免地涉及复杂的权衡,需要根据应用程序的要求进行调整。我将在本书的其余部分讨论这些基本的权衡,从下一章开始,我将描述实现可扩展性的具体架构方法。

Like any software architecture quality attribute, scalability cannot be achieved in isolation. It inevitably involves complex trade-offs that need to be tuned to an application’s requirements. I’ll be discussing these fundamental trade-offs throughout the remainder of this book, starting in the next chapter when I describe concrete architecture approaches to achieve scalability.

1 Neil Ernst 等人,实践中的技术债务:如何找到并修复它(麻省理工学院出版社,2021 年)。

1 Neil Ernst et al., Technical Debt in Practice: How to Find It and Fix It (MIT Press, 2021).

2 Ian Gorton 等人,“21 世纪的数据密集型计算”, Computer 41,第 1 期。4(2008 年 4 月):30-32。

2 Ian Gorton et al., “Data-Intensive Computing in the 21st Century,” Computer 41, no. 4 (April 2008): 30–32.

3 Rachel Potvin 和 Josh Levenberg,“为什么 Google 在单个存储库中存储数十亿行代码”, Communications of the ACM 59, 7(2016 年 7 月):78-87。

3 Rachel Potvin and Josh Levenberg, “Why Google Stores Billions of Lines of Code in a Single Repository,” Communications of the ACM 59, 7 (July 2016): 78–87.

4这份报告不适合那些神经质的人。这是一个说明性的 PG-13 数据点——该网站 2019 年的访问量为 420 亿次!一些统计数据肯定会让你眼睛瞪圆。

4 The report is not for the squeamish. Here’s one illustrative PG-13 data point—the site had 42 billion visits in 2019! Some of the statistics will definitely make your eyes bulge.

第 2 章分布式系统架构:简介

Chapter 2. Distributed Systems Architectures: An Introduction

在本章中,我将广泛介绍一些扩展软件系统的基本方法。您可以将此视为本书第二部分第三部分第四部分所涵盖内容的 30,000 英尺视图。我将带您浏览用于扩展系统的主要架构方法,并为后面深入处理这些问题的章节提供指导。您可以将其视为我们为什么需要这些架构策略的概述,而本书的其余部分则解释了如何进行。

In this chapter, I’ll broadly cover some of the fundamental approaches to scaling a software system. You can regard this as a 30,000-foot view of the content that is covered in Part II, Part III, and Part IV of this book. I’ll take you on a tour of the main architectural approaches used for scaling a system, and give pointers to later chapters where these issues are dealt with in depth. You can think of this as an overview of why we need these architectural tactics, with the remainder of the book explaining the how.

本书面向的系统类型是我们每天都使用的面向互联网的系统。我会让你说出你最喜欢的名字。这些系统通过网络和移动界面接受用户的请求,根据用户请求或事件存储和检索数据(例如,基于GPS的系统),并具有一些智能功能,例如根据先前的用户交互提供推荐或通知

The type of systems this book is oriented toward are the internet-facing systems we all utilize every day. I’ll let you name your favorite. These systems accept requests from users through web and mobile interfaces, store and retrieve data based on user requests or events (e.g., a GPS-based system), and have some intelligent features such as providing recommendations or notifications based on previous user interactions.

我将从一个简单的系统设计开始,并展示如何扩展它。在此过程中,我将介绍几个概念,本书稍后将更详细地介绍这些概念。本章只是对这些概念以及它们如何帮助扩展性进行了广泛的概述——真正的旋风之旅!

I’ll start with a simple system design and show how it can be scaled. In the process, I’ll introduce several concepts that will be covered in much more detail later in this book. This chapter just gives a broad overview of these concepts and how they aid in scalability—truly a whirlwind tour!

基本系统架构

Basic System Architecture

事实上,所有大规模系统都是从小规模开始的,并因其成功而不断发展。从 Ruby on Rails、Django 或类似框架等开发框架入手是常见且明智的做法,它可以促进快速开发,使系统快速启动并运行。图 2-1显示了一个典型的非常简单的“入门”系统软件架构,它与快速开发框架非常相似。这包括客户端层、应用程序服务层和数据库层。如果您使用 Rails 或同等产品,您还可以获得一个框架,该框架硬连线用于 Web 应用程序处理的模型-视图-控制器 (MVC) 模式以及生成 SQL 查询的对象关系映射器 (ORM)。

Virtually all massive-scale systems start off small and grow due to their success. It’s common, and sensible, to start with a development framework such as Ruby on Rails, Django, or equivalent, which promotes rapid development to get a system quickly up and running. A typical very simple software architecture for “starter” systems, which closely resembles what you get with rapid development frameworks, is shown in Figure 2-1. This comprises a client tier, application service tier, and a database tier. If you use Rails or equivalent, you also get a framework which hardwires a model–view–controller (MVC) pattern for web application processing and an object–relational mapper (ORM) that generates SQL queries.

基本的多层分布式系统架构
图 2-1。基本的多层分布式系统架构

通过这种架构,用户可以从移动应用程序或 Web 浏览器向应用程序提交请求。互联网的魔力(参见第 3 章)将这些请求传递到在某些企业或商业云数据中心托管的计算机上运行的应用程序服务。通信使用标准应用程序级网络协议,通常是 HTTP。

With this architecture, users submit requests to the application from their mobile app or web browser. The magic of internet networking (see Chapter 3) delivers these requests to the application service which is running on a machine hosted in some corporate or commercial cloud data center. Communications uses a standard application-level network protocol, typically HTTP.

应用程序服务运行支持客户端用来发送 HTTP 请求的 API 的代码。收到请求后,服务将执行与所请求的 API 关联的代码。在此过程中,它可能会读取或写入数据库或其他外部系统,具体取决于 API 的语义。请求完成后,服务会将结果发送到客户端以显示在其应用程序或浏览器中。

The application service runs code supporting an API that clients use to send HTTP requests. Upon receipt of a request, the service executes the code associated with the requested API. In the process, it may read from or write to a database or some other external system, depending on the semantics of the API. When the request is complete, the service sends the results to the client to display in their app or browser.

许多(如果不是大多数)系统在概念上看起来与此完全相同。应用程序服务代码利用服务器执行环境,该环境允许同时处理来自多个用户的多个请求。有无数种应用程序服务器技术(例如,Java EE 和 Spring Framework for Java、Flask for Python)在此场景中广泛使用。

Many, if not most systems conceptually look exactly like this. The application service code exploits a server execution environment that enables multiple requests from multiple users to be processed simultaneously. There’s a myriad of these application server technologies—for example, Java EE and the Spring Framework for Java, Flask for Python—that are widely used in this scenario.

这种方法导致了通常所说的整体架构。1随着应用程序的功能变得更加丰富,单体应用的复杂性往往会增加。所有 API 处理程序都内置到同一服务器代码体中。这最终会导致快速修改和测试变得困难,并且由于所有 API 实现都在同一应用程序服务中运行,因此执行足迹可能会变得极其繁重。

This approach leads to what is generally known as a monolithic architecture.1 Monoliths tend to grow in complexity as the application becomes more feature-rich. All API handlers are built into the same server code body. This eventually makes it hard to modify and test rapidly, and the execution footprint can become extremely heavyweight as all the API implementations run in the same application service.

不过,如果请求负载保持相对较低,这种应用程序架构就足够了。该服务能够以始终如一的低延迟处理请求。但如果请求负载持续增长,这意味着延迟将会增加,因为服务没有足够的 CPU/内存容量来处理并发请求量,因此请求将需要更长的时间来处理。在这种情况下,我们的单台服务器就超载了,成为了瓶颈。

Still, if request loads stay relatively low, this application architecture can suffice. The service has the capacity to process requests with consistently low latency. But if request loads keep growing, this means latencies will increase as the service has insufficient CPU/memory capacity for the concurrent request volume and therefore requests will take longer to process. In these circumstances, our single server is overloaded and has become a bottleneck.

在这种情况下,第一个扩展策略通常是“扩展”应用服务硬件。例如,如果您的应用程序在 AWS 上运行,您可以将服务器从具有四个(虚拟)CPU 和 16 GB 内存的普通 t3.xlarge 实例升级到 t3.2xlarge 实例,这将使可用的 CPU 和内存数量加倍对于应用程序。2

In this case, the first strategy for scaling is usually to “scale up” the application service hardware. For example, if your application is running on AWS, you might upgrade your server from a modest t3.xlarge instance with four (virtual) CPUs and 16 GB of memory to a t3.2xlarge instance, which doubles the number of CPUs and memory available for the application.2

扩大规模很简单。它使许多现实世界的应用程序能够支持更大的工作负载。显然,硬件成本更高,但这对你来说是可扩展的。

Scaling up is simple. It gets many real-world applications a long way to supporting larger workloads. It obviously costs more money for hardware, but that’s scaling for you.

然而,对于许多应用程序来说,无论有多少个 CPU 和多少内存,负载都会不可避免地增长到淹没单个服务器节点的水平。这时候你就需要一个新的策略——即横向扩展,或者水平扩展,我在第一章中谈到过。

It’s inevitable, however, that for many applications the load will grow to a level which will swamp a single server node, no matter how many CPUs and how much memory you have. That’s when you need a new strategy—namely, scaling out, or horizontal scaling, which I touched on in Chapter 1.

向外扩展

Scale Out

横向扩展依赖于在架构中复制服务并在多个服务器节点上运行多个副本的能力。来自客户端的请求分布在各个副本上,因此理论上,如果我们有 N 个副本和 R 个请求,则每个服务器节点都会处理 R/N 个请求。这个简单的策略增加了应用程序的容量,从而提高了可扩展性。

Scaling out relies on the ability to replicate a service in the architecture and run multiple copies on multiple server nodes. Requests from clients are distributed across the replicas so that in theory, if we have N replicas and R requests, each server node processes R/N requests. This simple strategy increases an application’s capacity and hence scalability.

要成功扩展应用程序,您的设计中需要两个基本元素。如图 2-2所示,它们是:

To successfully scale out an application, you need two fundamental elements in your design. As illustrated in Figure 2-2, these are:

负载均衡器
A load balancer
所有用户请求都发送到负载均衡器,负载均衡器选择服务副本目标来处理请求。选择目标服务有多种策略,所有策略的核心目标都是保持每个资源同等繁忙。负载均衡器还将服务的响应转发回客户端。大多数负载平衡器属于一类称为反向代理的互联网组件。它们控制客户端请求对服务器资源的访问。作为中介,反向代理为请求添加额外的网络跃点;它们需要极低的延迟,以最大限度地减少它们引入的开销。有许多现成的负载平衡解决方案以及云提供商特定的解决方案,我将在第 5 章中更详细地介绍这些解决方案的一般特征
All user requests are sent to a load balancer, which chooses a service replica target to process the request. Various strategies exist for choosing a target service, all with the core aim of keeping each resource equally busy. The load balancer also relays the responses from the service back to the client. Most load balancers belong to a class of internet components known as reverse proxies. These control access to server resources for client requests. As an intermediary, reverse proxies add an extra network hop for a request; they need to be extremely low latency to minimize the overheads they introduce. There are many off-the-shelf load balancing solutions as well as cloud provider–specific ones, and I’ll cover the general characteristics of these in much more detail in Chapter 5.
无状态服务
Stateless services
为了使负载均衡有效并均匀地共享请求,负载均衡器必须能够自由地将来自同一客户端的连续请求发送到不同的服务实例进行处理。这意味着服务中的 API 实现不得保留与单个客户端会话相关的任何知识或状态。当用户访问应用程序时,服务会创建一个用户会话,并在内部管理一个唯一的会话,以识别用户交互的顺序并跟踪会话状态。会话状态的一个典型示例是购物车。为了有效地使用负载均衡器,表示用户购物车当前内容的数据必须存储在某个地方(通常是数据存储),以便任何服务副本在收到作为用户会话一部分的请求时都可以访问此状态。在图 2-2,这被标记为“会话存储”。
For load balancing to be effective and share requests evenly, the load balancer must be free to send consecutive requests from the same client to different service instances for processing. This means the API implementations in the services must retain no knowledge, or state, associated with an individual client’s session. When a user accesses an application, a user session is created by the service and a unique session is managed internally to identify the sequence of user interactions and track session state. A classic example of session state is a shopping cart. To use a load balancer effectively, the data representing the current contents of a user’s cart must be stored somewhere—typically a data store—such that any service replica can access this state when it receives a request as part of a user session. In Figure 2-2, this is labeled as a “Session store.”

横向扩展很有吸引力,因为从理论上讲,您可以不断添加新的(虚拟)硬件和服务来处理增加的请求负载并保持请求延迟一致且较低。一旦发现延迟上升,就部署另一个服务器实例。这不需要使用无状态服务更改代码,因此相对便宜 - 您只需为部署的硬件付费。

Scaling out is attractive as, in theory, you can keep adding new (virtual) hardware and services to handle increased request loads and keep request latencies consistent and low. As soon as you see latencies rising, you deploy another server instance. This requires no code changes with stateless services and is relatively cheap as a result—you just pay for the hardware you deploy.

横向扩展还有另一个极具吸引力的功能。如果其中一项服务出现故障,它正在处理的请求将会丢失。但由于失败的服务不管理会话状态,因此客户端可以简单地重新发出这些请求并将其发送到另一个服务实例进行处理。这意味着应用程序能够适应服务软件和硬件的故障,从而增强应用程序的可用性。

Scaling out has another highly attractive feature. If one of the services fails, the requests it is processing will be lost. But as the failed service manages no session state, these requests can be simply reissued by the client and sent to another service instance for processing. This means the application is resilient to failures in the service software and hardware, thus enhancing the application’s availability.

不幸的是,与任何工程解决方案一样,简单的横向扩展也有局限性。当您添加新的服务实例时,请求处理能力会增长,甚至可能无限增长。然而,在某些阶段,现实将会残酷,单个数据库提供低延迟查询响应的能力将会减弱。缓慢的查询将意味着客户端的响应时间更长。如果请求到达的速度始终快于处理的速度,某些系统组件将变得过载并因资源耗尽而失败,并且客户端将看到异常和请求超时。从本质上讲,您的数据库成为您必须消除的瓶颈,以便进一步扩展您的应用程序。

Unfortunately, as with any engineering solution, simple scaling out has limits. As you add new service instances, the request processing capacity grows, potentially infinitely. At some stage, however, reality will bite and the capability of your single database to provide low-latency query responses will diminish. Slow queries will mean longer response times for clients. If requests keep arriving faster than they are being processed, some system components will become overloaded and fail due to resource exhaustion, and clients will see exceptions and request timeouts. Essentially, your database becomes a bottleneck that you must engineer away in order to scale your application further.

横向扩展架构
图 2-2。横向扩展架构

通过缓存扩展数据库

Scaling the Database with Caching

通过增加数据库服务器中的 CPU、内存和磁盘数量来进行扩展对于扩展系统大有帮助。例如,在撰写本文时,GCP 可以在 db-n1-highmem-96 节点上配置 SQL 数据库,该节点具有 96 个虚拟 CPU (vCPU)、624 GB 内存、30 TB 磁盘,并且可以支持 4,000 个连接。每年的费用在 6,000 美元到 16,000 美元之间,这对我来说听起来很划算!纵向扩展是一种常见的数据库可扩展性策略。

Scaling up by increasing the number of CPUs, memory, and disks in a database server can go a long way to scaling a system. For example, at the time of writing, GCP can provision a SQL database on a db-n1-highmem-96 node, which has 96 virtual CPUs (vCPUs), 624 GB of memory, 30 TB of disk, and can support 4,000 connections. This will cost somewhere between $6K and $16K per year, which sounds like a good deal to me! Scaling up is a common database scalability strategy.

大型数据库需要高技能数据库管理员的持续关注和关注,以保持其调整和快速运行。这项工作涉及很多技巧,例如查询调优、磁盘分区、索引、节点上缓存等等,因此数据库管理员是您想要非常友善的有价值的人。它们可以使您的应用程序服务具有高度响应能力。

Large databases need constant care and attention from highly skilled database administrators to keep them tuned and running fast. There’s a lot of wizardry in this job—e.g., query tuning, disk partitioning, indexing, on-node caching, and so on—and hence database administrators are valuable people you want to be very nice to. They can make your application services highly responsive.

与扩展相结合,一种非常有效的方法是尽可能少地从服务中查询数据库。这可以通过在横向扩展服务层中使用分布式缓存来实现。将最近检索和经常访问的数据库结果缓存存储在内存中,以便可以快速检索它们,而不会给数据库带来负担。例如,下一个小时的天气预报不会改变,但可能会被数百或数千个客户端查询。发布后,您可以使用缓存来存储预测。所有客户端请求都将从缓存中读取,直到预测过期。

In conjunction with scaling up, a highly effective approach is querying the database as infrequently as possible from your services. This can be achieved by employing distributed caching in the scaled-out service tier. Caching stores recently retrieved and commonly accessed database results in memory so they can be quickly retrieved without placing a burden on the database. For example, the weather forecast for the next hour won’t change, but may be queried by hundreds or thousands of clients. You can use a cache to store the forecast once it is issued. All client requests will read from the cache until the forecast expires.

对于经常读取且很少更改的数据,可以修改处理逻辑以首先检查分布式缓存,例如 Redismemcached存储。这些缓存技术本质上是具有非常简单的 API 的分布式键值存储。该方案如图2-3所示。请注意,图 2-2中的会话存储已消失。这是因为您可以使用通用分布式缓存来存储会话标识符和应用程序数据。

For data that is frequently read and changes rarely, your processing logic can be modified to first check a distributed cache, such as a Redis or memcached store. These cache technologies are essentially distributed key-value stores with very simple APIs. This scheme is illustrated in Figure 2-3. Note that the session store from Figure 2-2 has disappeared. This is because you can use a general-purpose distributed cache to store session identifiers along with application data.

引入分布式缓存
图 2-3。引入分布式缓存

访问缓存需要从您的服务进行远程调用。如果您需要的数据在缓存中,那么在快速网络上,您可以期待亚毫秒级的缓存读取。这比查询共享数据库实例的成本要低得多,并且也不需要查询来争夺通常稀缺的数据库连接。

Accessing the cache requires a remote call from your service. If the data you need is in the cache, on a fast network you can expect submillisecond cache reads. This is far less expensive than querying the shared database instance, and also doesn’t require a query to contend for typically scarce database connections.

引入缓存层还需要修改您的处理逻辑以检查缓存的数据。如果缓存中没有您想要的内容,您的代码仍然必须查询数据库并将结果加载到缓存中并将其返回给调用者。您还需要决定何时删除或使缓存的结果失效——您的操作过程取决于数据的性质(例如,天气预报自然过期)以及您的应用程序对服务过时(也称为过时)的容忍度——给客户的结果。

Introducing a caching layer also requires your processing logic to be modified to check for cached data. If what you want is not in the cache, your code must still query the database and load the results into the cache as well as return it to the caller. You also need to decide when to remove, or invalidate, cached results—your course of action depends on the nature of your data (e.g., weather forecasts expire naturally) and your application’s tolerance to serving out-of-date—also known as stale—results to clients.

精心设计的缓存方案对于扩展系统来说非常宝贵。缓存非常适合很少更改且经常访问的数据,例如库存目录、事件信息和联系人数据。如果您可以处理很大比例(例如 80% 或更多)的缓存读取请求,那么您可以有效地在数据库中购买额外的容量,因为它们永远不会看到很大比例的请求。

A well-designed caching scheme can be invaluable in scaling a system. Caching works great for data that rarely changes and is accessed frequently, such as inventory catalogs, event information, and contact data. If you can handle a large percentage, say, 80% or more, of read requests from your cache, then you effectively buy extra capacity at your databases as they never see a large proportion of requests.

尽管如此,许多系统需要快速访问 TB 级和更大的数据存储,这使得单个数据库实际上令人望而却步。在这些系统中,需要分布式数据库。

Still, many systems need to rapidly access terabytes and larger data stores that make a single database effectively prohibitive. In these systems, a distributed database is needed.

分发数据库

Distributing the Database

到 2022 年,分布式数据库技术的数量可能超出您的想象。这是一个复杂的领域,我将在第三部分中广泛介绍这一领域。一般来说,有两大类:

There are more distributed database technologies around in 2022 than you probably want to imagine. It’s a complex area, and one I’ll cover extensively in Part III. In very general terms, there are two major categories:

分布式 SQL 存储
Distributed SQL stores
这些使组织能够通过将数据存储在多个数据库引擎副本查询的多个磁盘上,相对无缝地扩展其 SQL 数据库。这些多个引擎在逻辑上对应用程序显示为单个数据库,从而最大限度地减少代码更改。还有一类“天生分布式”SQL 数据库,通常称为 NewSQL 存储,属于此类。
These enable organizations to scale out their SQL database relatively seamlessly by storing the data across multiple disks that are queried by multiple database engine replicas. These multiple engines logically appear to the application as a single database, hence minimizing code changes. There is also a class of “born distributed” SQL databases that are commonly known as NewSQL stores that fit in this category.
分布式所谓的“NoSQL”存储(来自一系列供应商)
Distributed so-called “NoSQL” stores (from a whole array of vendors)
这些产品使用各种数据模型和查询语言在运行数据库引擎的多个节点之间分发数据,每个节点都有自己的本地连接存储。同样,数据的位置对于应用程序来说是透明的,并且通常由使用数据库键上的散列函数的数据模型的设计来控制。该类别中的领先产品是 Cassandra、MongoDB 和 Neo4j。
These products use a variety of data models and query languages to distribute data across multiple nodes running the database engine, each with their own locally attached storage. Again, the location of the data is transparent to the application, and typically controlled by the design of the data model using hashing functions on database keys. Leading products in this category are Cassandra, MongoDB, and Neo4j.

图 2-4显示了我们的架构如何整合分布式数据库。随着数据量的增长,分布式数据库可以增加存储节点的数量。添加(或删除)节点时,会重新平衡所有节点上管理的数据,以确保平等地利用每个节点的处理和存储容量。

Figure 2-4 shows how our architecture incorporates a distributed database. As the data volumes grow, a distributed database can increase the number of storage nodes. As nodes are added (or removed), the data managed across all nodes is rebalanced to attempt to ensure the processing and storage capacity of each node is equally utilized.

使用分布式数据库扩展数据层
图 2-4。使用分布式数据库扩展数据层

分布式数据库还可以提高可用性。它们支持复制每个数据存储节点,因此如果一个节点出现故障或由于网络问题而无法访问,则可以使用另一份数据副本。用于复制的模型以及这些模型所需的权衡(剧透警告:一致性)将在后面的章节中介绍。

Distributed databases also promote availability. They support replicating each data storage node so if one fails or cannot be accessed due to network problems, another copy of the data is available. The models utilized for replication and the trade-offs these require (spoiler alert: consistency) are covered in later chapters.

如果您使用的是主要的云提供商,则您的数据层还有两种部署选择。您可以部署自己的虚拟资源并构建、配置和管理自己的分布式数据库服务器。或者,您可以使用云托管数据库。后者简化了与管理、监控和扩展数据库相关的管理工作,因为其中许多任务基本上成为您选择的云提供商的责任。与往常一样,没有免费的午餐原则适用。无论您选择哪种方式,您总是要付费的。

If you are utilizing a major cloud provider, there are also two deployment choices for your data tier. You can deploy your own virtual resources and build, configure, and administer your own distributed database servers. Alternatively, you can utilize cloud-hosted databases. The latter simplifies the administrative effort associated with managing, monitoring, and scaling the database, as many of these tasks essentially become the responsibility of the cloud provider you choose. As usual, the no free lunch principle applies. You always pay, whichever approach you choose.

多个处理层

Multiple Processing Tiers

您需要扩展的任何实际系统都将具有许多不同的服务,这些服务通过交互来处理请求。例如,访问Amazon.com上的网页可能需要调用超过 100 个不同的服务才能将响应返回给用户。3

Any realistic system that you need to scale will have many different services that interact to process a request. For example, accessing a web page on Amazon.com can require in excess of 100 different services being called before a response is returned to the user.3

我在本章中阐述的无状态、负载平衡、缓存架构的优点在于可以扩展核心设计原则并构建多层应用程序。在满足请求时,服务可以调用一个或多个依赖的服务,这些服务依次被复制和负载平衡。一个简单的例子如图2-5所示。服务如何交互以及应用程序如何确保依赖服务的快速响应存在许多细微差别。同样,我将在后面的章节中详细介绍这些内容。

The beauty of the stateless, load-balanced, cached architecture I am elaborating in this chapter is that it’s possible to extend the core design principles and build a multitiered application. In fulfilling a request, a service can call one or more dependent services, which in turn are replicated and load-balanced. A simple example is shown in Figure 2-5. There are many nuances in how the services interact, and how applications ensure rapid responses from dependent services. Again, I’ll cover these in detail in later chapters.

多级扩展处理能力
图 2-5。多级扩展处理能力

这种设计还促进在架构中的每一层拥有不同的负载平衡服务。例如,图 2-6说明了两个复制的面向 Internet 的服务,它们都使用提供数据库访问的核心服务。每个服务都是负载平衡的,并使用缓存来提供高性能和可用性。这种设计通常用于为 Web 客户端和移动客户端提供服务,每个客户端都可以根据它们所经历的负载独立扩展。它通常称为前端后端 (BFF) 模式。4

This design also promotes having different, load-balanced services at each tier in the architecture. For example, Figure 2-6 illustrates two replicated internet-facing services that both utilized a core service that provides database access. Each service is load balanced and employs caching to provide high performance and availability. This design is often used to provide a service for web clients and a service for mobile clients, each of which can be scaled independently based on the load they experience. It’s commonly called the Backend for Frontend (BFF) pattern.4

具有多种服务的可扩展架构
图 2-6。具有多种服务的可扩展架构

此外,通过将应用程序分解为多个独立的服务,您可以根据服务需求扩展每个服务。例如,如果您发现移动用户的请求量不断增加,而 Web 用户的请求量不断减少,则可以为每项服务配置不同数量的实例来满足需求。这是将单体应用程序重构为多个独立服务的主要优点,这些服务可以单独构建、测试、部署和扩展。我将在第 9 章中探讨基于此类服务(称为微服务)设计系统的一些主要问题。

In addition, by breaking the application into multiple independent services, you can scale each based on the service demand. If, for example, you see an increasing volume of requests from mobile users and decreasing volumes from web users, it’s possible to provision different numbers of instances for each service to satisfy demand. This is a major advantage of refactoring monolithic applications into multiple independent services, which can be separately built, tested, deployed, and scaled. I’ll explore some of the major issues in designing systems based on such services, known as microservices, in Chapter 9.

提高响应能力

Increasing Responsiveness

大多数客户端应用程序请求都期望得到响应。用户可能希望查看给定产品类别的所有拍卖项目或查看给定位置可出售的房地产。在这些示例中,客户端发送请求并等待收到响应。发送请求到收到结果之间的这个时间间隔就是请求的响应时间。您可以通过使用缓存和预先计算的响应来减少响应时间,但许多请求仍然会导致数据库访问。

Most client application requests expect a response. A user might want to see all auction items for a given product category or see the real estate that is available for sale in a given location. In these examples, the client sends a request and waits until a response is received. This time interval between sending the request and receiving the result is the response time of the request. You can decrease response times by using caching and precalculated responses, but many requests will still result in database accesses.

对于更新应用程序中的数据的请求也存在类似的情况。如果用户在下订单之前更新其送货地址,则必须保留新的送货地址,以便用户可以在点击“购买”按钮之前确认该地址。这种情况下的响应时间包括数据库写入的时间,该时间由用户收到的响应确认。

A similar scenario exists for requests that update data in an application. If a user updates their delivery address immediately prior to placing an order, the new delivery address must be persisted so that the user can confirm the address before they hit the “purchase” button. The response time in this case includes the time for the database write, which is confirmed by the response the user receives.

然而,某些更新请求可以成功响应,而无需将数据完全保留在数据库中。例如,滑雪者和滑雪板爱好者会熟悉缆车票扫描系统,该系统会检查您是否拥有当天乘坐缆车的有效通行证。他们还记录您乘坐的电梯、乘坐的时间等等。然后,滑雪爱好者/单板滑雪爱好者可以使用度假村的移动应用程序查看他们一天乘坐的缆车次数。

Some update requests, however, can be successfully responded to without fully persisting the data in a database. For example, the skiers and snowboarders among you will be familiar with lift ticket scanning systems that check you have a valid pass to ride the lifts that day. They also record which lifts you take, the time you get on, and so on. Nerdy skiers/snowboarders can then use the resort’s mobile app to see how many lifts they ride in a day.

当一个人等待乘电梯时,扫描仪设备会使用 RFID(射频识别)芯片读取器来验证通行证。然后,有关骑手、缆车和时间的信息通过互联网发送到滑雪胜地运营的数据采集服务。电梯乘客不必等待这种情况发生,因为响应时间可能会减慢电梯装载过程。电梯乘客也不期望他们可以立即使用他们的应用程序来确保捕获这些数据。他们只是坐上电梯,和朋友聊聊,然后计划下一次跑步。

As a person waits to get on a lift, a scanner device validates the pass using an RFID (radio-frequency identification) chip reader. The information about the rider, lift, and time are then sent over the internet to a data capture service operated by the ski resort. The lift rider doesn’t have to wait for this to occur, as the response time could slow down the lift-loading process. There’s also no expectation from the lift rider that they can instantly use their app to ensure this data has been captured. They just get on the lift, talk smack with their friends, and plan their next run.

服务实现可以利用这种类型的场景来提高响应能力。有关事件的数据被发送到服务,该服务确认接收并同时将数据存储在远程队列中,以便随后写入数据库。分布式排队平台可用于将数据从一个服务可靠地发送到另一个服务,通常但并不总是以先进先出 (FIFO) 方式。

Service implementations can exploit this type of scenario to improve responsiveness. The data about the event is sent to the service, which acknowledges receipt and concurrently stores the data in a remote queue for subsequent writing to the database. Distributed queueing platforms can be used to reliably send data from one service to another, typically but not always in a first-in, first-out (FIFO) manner.

将消息写入队列通常比写入数据库快得多,这使得请求能够更快地成功确认。部署另一个服务来从队列中读取消息并将数据写入数据库。当滑雪者检查他们的缆车时(可能是三个小时或三天后),数据已成功保存在数据库中。

Writing a message to a queue is typically much faster than writing to a database, and this enables the request to be successfully acknowledged much more quickly. Another service is deployed to read messages from the queue and write the data to the database. When a skier checks their lift rides—maybe three hours or three days later—the data has been persisted successfully in the database.

实现该方法的基本架构如图2-7所示。

The basic architecture to implement this approach is illustrated in Figure 2-7.

通过排队提高响应能力
图 2-7。通过排队提高响应能力

只要不立即需要写入操作的结果,应用程序就可以使用此方法来提高响应能力,从而提高可扩展性。应用程序可以利用许多排队技术,我将在第 7 章中讨论这些技术的运作方式。这些排队平台都提供异步通信。生产者服务写入充当临时存储的队列,而另一个消费者服务从队列中删除消息并对存储滑雪者缆车乘坐详细信息的数据库进行必要的更新

Whenever the results of a write operation are not immediately needed, an application can use this approach to improve responsiveness and, as a result, scalability. Many queueing technologies exist that applications can utilize, and I’ll discuss how these operate in Chapter 7. These queueing platforms all provide asynchronous communications. A producer service writes to the queue, which acts as temporary storage, while another consumer service removes messages from the queue and makes the necessary updates to, in our example, a database that stores skier lift ride details.

关键是数据最终会被持久化。最终通常意味着最多几秒钟,但采用这种设计的用例应该能够适应更长的延迟,而不影响用户体验。

The key is that the data eventually gets persisted. Eventually typically means a few seconds at most but use cases that employ this design should be resilient to longer delays without impacting the user experience.

系统和硬件可扩展性

Systems and Hardware Scalability

如果服务和数据存储在不足的硬件上运行,即使是最精心设计的软件架构和代码也会在可扩展性方面受到限制。通常部署在可扩展系统中的开源和商业平台旨在利用 CPU 内核、内存和磁盘等额外硬件资源。这是实现您所需的性能和可扩展性与尽可能降低成本之间的平衡行为。

Even the most carefully crafted software architecture and code will be limited in terms of scalability if the services and data stores run on inadequate hardware. The open source and commercial platforms that are commonly deployed in scalable systems are designed to utilize additional hardware resources in terms of CPU cores, memory, and disks. It’s a balancing act between achieving the performance and scalability you require, and keeping your costs as low as possible.

也就是说,在某些情况下,升级 CPU 核心数量和可用内存并不能为您带来更多的可扩展性。例如,如果代码是单线程的,那么在具有更多内核的节点上运行它不会提高性能。它在任何时候都只使用一个核心。其余的根本就没有使用。如果多线程代码包含许多序列化部分,则一次只能有一个线程核心继续执行,以确保结果正确。这种现象由阿姆达尔定律描述。这为我们提供了一种根据串行执行的代码量来计算添加更多 CPU 内核时代码的理论加速的方法。

That said, there are some cases where upgrading the number of CPU cores and available memory is not going to buy you more scalability. For example, if code is single threaded, running it on a node with more cores is not going to improve performance. It’ll just use one core at any time. The rest are simply not used. If multithreaded code contains many serialized sections, only one threaded core can proceed at a time to ensure the results are correct. This phenomenon is described by Amdahl’s law. This gives us a way to calculate the theoretical acceleration of code when adding more CPU cores based on the amount of code that executes serially.

阿姆达尔定律的两个数据点是:

Two data points from Amdahl’s law are:

  • 如果只有 5% 的代码串行执行,其余代码并行执行,那么添加超过 2,048 个核心基本上没有任何效果。

  • If only 5% of a code executes serially, the rest in parallel, adding more than 2,048 cores has essentially no effect.

  • 如果 50% 的代码串行执行,其余代码并行执行,那么添加超过 8 个核心基本上没有任何效果。

  • If 50% of a code executes serially, the rest in parallel, adding more than 8 cores has essentially no effect.

这说明了为什么高效的多线程代码对于实现可扩展性至关重要。如果您的代码没有像线程那样运行高度独立的任务,那么即使有钱也买不到可扩展性。这就是为什么我在第 4 章专门讨论多线程主题——它是构建可扩展分布式系统的核心知识组件。

This demonstrates why efficient, multithreaded code is essential to achieving scalability. If your code is not running as highly independent tasks implemented as threads, then not even money will buy you scalability. That’s why I devote Chapter 4 to the topic of multithreading—it’s a core knowledge component for building scalable distributed systems.

为了说明升级硬件的效果,图 2-8显示了当数据库部署在更强大(且更昂贵)的硬件上时,基准系统的吞吐量如何提高。5该基准测试采用 Java 服务,该服务接受来自负载生成客户端的请求、查询数据库并将结果返回给客户端。客户端、服务和数据库运行在部署在AWS云中相同区域的不同硬件资源上。

To illustrate the effect of upgrading hardware, Figure 2-8 shows how the throughput of a benchmark system improves as the database is deployed on more powerful (and expensive) hardware.5 The benchmark employs a Java service that accepts requests from a load generating client, queries a database, and returns the results to the client. The client, service, and database run on different hardware resources deployed in the same regions in the AWS cloud.

扩展数据库服务器的示例
图 2-8。扩展数据库服务器的示例

在测试中,并发请求数从 32 个增加到 256 个(x轴),每条线代表AWS EC2 关系数据库服务 (RDS) 上不同硬件配置的系统吞吐量(y轴)。图表底部列出了不同的配置,左侧最弱,右侧最强大。每个客户端通过 HTTP 同步发送固定数量的请求,在接收一个请求的结果和发送下一个请求之间没有暂停。因此,这会对服务器施加较高的请求负载。

In the tests, the number of concurrent requests grows from 32 to 256 (x-axis) and each line represents the system throughput (y-axis) for a different hardware configuration on the AWS EC2’s Relational Database Service (RDS). The different configurations are listed at the bottom of the chart, with the least powerful on the left and most powerful on the right. Each client sends a fixed number of requests synchronously over HTTP, with no pause between receiving results from one request and sending the next. This consequently exerts a high request load on the server.

从这张图表中,可以得出一些简单的观察结果:

From this chart, it’s possible to make some straightforward observations:

  • 一般来说,为数据库选择的硬件越强大,吞吐量就越高。那很好。

  • In general, the more powerful the hardware selected for the database, the higher the throughput. That is good.

  • db.t2.xlarge 和 db.t2.2xlarge 实例在吞吐量方面的差异很小。这可能是因为服务层正在成为瓶颈,或者我们的数据库模型和查询没有利用 db.t2.2xlarge RDS 实例的额外资源。不管怎样——花更多的钱,却没有任何好处。

  • The difference between the db.t2.xlarge and db.t2.2xlarge instances in terms of throughput is minimal. This could be because the service tier is becoming a bottleneck, or our database model and queries are not exploiting the additional resources of the db.t2.2xlarge RDS instance. Regardless—more bucks, no bang.

  • 在请求负载增加到 256 个并发客户端之前,两个功能最弱的实例表现得相当好。这两个实例的吞吐量下降表明它们已经过载,如果请求负载增加,情况只会变得更糟。

  • The two least powerful instances perform pretty well until the request load is increased to 256 concurrent clients. The dip in throughput for these two instances indicates they are overloaded and things will only get worse if the request load increases.

希望这个简单的例子能够说明为什么需要谨慎地通过简单的硬件升级来进行扩展。添加更多硬件总是会增加成本,但可能并不总能带来您期望的性能改进。运行简单的实验并进行测量对于评估硬件升级的效果至关重要。它为您提供可靠的数据来指导您的设计并向利益相关者证明成本的合理性。

Hopefully, this simple example illustrates why scaling through simple upgrading of hardware needs to be approached carefully. Adding more hardware always increases costs, but may not always give the performance improvement you expect. Running simple experiments and taking measurements is essential for assessing the effects of hardware upgrades. It gives you solid data to guide your design and justify costs to stakeholders.

总结和延伸阅读

Summary and Further Reading

在本章中,我简要介绍了可用于将系统扩展为通信服务和分布式数据库集合的主要方法。许多细节都被忽略了,而且您无疑已经意识到,在软件系统中,细节决定成败。因此,后续章节将逐步开始探讨这些细节,从第 3 章中每个人都应该了解的分布式系统的一些基本特征开始。

In this chapter I’ve provided a whirlwind tour of the major approaches you can utilize to scale out a system as a collection of communicating services and distributed databases. Much detail has been brushed over, and as you have no doubt realized—in software systems the devil is in the detail. Subsequent chapters will therefore progressively start to explore these details, starting with some fundamental characteristics of distributed systems in Chapter 3 that everyone should be aware of.

本章回避的另一个领域是软件架构主题。我在实现应用程序业务逻辑和数据库访问的体系结构中使用术语服务来表示分布式组件。这些服务是独立部署的进程,使用 HTTP 等远程通信机制进行通信。在架构方面,这些服务与面向服务的架构 (SOA) 模式(一种用于构建分布式系统的既定架构方法)中的服务最接近。这种方法的更现代的演变围绕着微服务。这些往往是更具凝聚力的封装服务,可促进持续开发和部署。

Another area this chapter has skirted around is the subject of software architecture. I’ve used the term services for distributed components in an architecture that implement application business logic and database access. These services are independently deployed processes that communicate using remote communications mechanisms such as HTTP. In architectural terms, these services are most closely mirrored by those in the service-oriented architecture (SOA) pattern, an established architectural approach for building distributed systems. A more modern evolution of this approach revolves around microservices. These tend to be more cohesive, encapsulated services that promote continuous development and deployment.

如果您想更深入地讨论这些问题以及一般的软件架构概念,那么 Mark Richards 和 Neal Ford 的书《Fundamentals of Software Architecture: An Engineering Approach》(O'Reilly,2020)是一个很好的地方开始。

If you’d like a much more in-depth discussion of these issues, and software architecture concepts in general, then Mark Richards’ and Neal Ford’s book Fundamentals of Software Architecture: An Engineering Approach (O’Reilly, 2020) is an excellent place to start.

最后,有一类大数据软件架构可以解决非常大的数据集合中出现的一些问题。最突出的问题之一是数据再处理。当由于代码或业务规则更改而需要重新分析已存储和分析的数据时,就会发生这种情况。这种重新处理可能是由于软件修复或引入可以从原始数据中获得更多见解的新算法而发生的。在O'Reilly Radar 博客上的Jay Krepps 2014 年文章“质疑 Lambda 架构”中,对 Lambda 和 Kappa 架构进行了很好的讨论,这两种架构在该领域都很突出。

Finally, there’s a class of big data software architectures that address some of the issues that come to the fore with very large data collections. One of the most prominent is data reprocessing. This occurs when data that has already been stored and analyzed needs to be reanalyzed due to code or business rule changes. This reprocessing may occur due to software fixes, or the introduction of new algorithms that can derive more insights from the original raw data. There’s a good discussion of the Lambda and Kappa architectures, both of which are prominent in this space, in Jay Krepps’ 2014 article “Questioning the Lambda Architecture” on the O’Reilly Radar blog.

1马克·理查兹和尼尔·福特。软件架构基础知识:一种工程方法(O'Reilly Media,2020)。

1 Mark Richards and Neal Ford. Fundamentals of Software Architecture: An Engineering Approach (O’Reilly Media, 2020).

2有关 AWS 实例的说明,请参阅Amazon EC2 实例类型。

2 See Amazon EC2 Instance Types for a description of AWS instances.

3 Werner Vogels,“AWS 的现代应用程序”,All Things Distributed,2019 年 8 月 28 日, https://oreil.ly/FXOep

3 Werner Vogels, “Modern Applications at AWS,” All Things Distributed, 28 Aug. 2019, https://oreil.ly/FXOep.

4 Sam Newman,“模式:前端的后端”,Sam Newman & Associates,2015 年 11 月 18 日。https: //oreil.ly/1KR1z

4 Sam Newman, “Pattern: Backends For Frontends,” Sam Newman & Associates, November 18, 2015. https://oreil.ly/1KR1z.

5结果由西雅图东北大学肖锐杰提供。

5 Results are courtesy of Ruijie Xiao from Northeastern University, Seattle.

第 3 章分布式系统要点

Chapter 3. Distributed Systems Essentials

正如我在第 2 章中所述,扩展系统自然涉及添加多个独立移动的部件。我们在多台机器上运行我们的软件组件,并在多个存储节点上运行我们的数据库,所有这些都是为了增加更多的处理能力。因此,我们的解决方案分布在多个位置的多台机器上,每台机器同时处理事件,并通过网络交换消息。

As I described in Chapter 2, scaling a system naturally involves adding multiple independently moving parts. We run our software components on multiple machines and our databases across multiple storage nodes, all in the quest of adding more processing capacity. Consequently, our solutions are distributed across multiple machines in multiple locations, with each machine processing events concurrently, and exchanging messages over a network.

分布式系统的这种基本性质对我们设计、构建和运营解决方案的方式有一些深远的影响。本章提供了了解分布式软件系统的问题和复杂性所需了解的基本信息。我将简要介绍通信网络硬件和软件、远程方法调用、如何处理通信故障的影响、分布式协调以及分布式系统中棘手的时间问题。

This fundamental nature of distributed systems has some profound implications on the way we design, build, and operate our solutions. This chapter provides the basic information you need to know to appreciate the issues and complexities of distributed software systems. I’ll briefly cover communications networks hardware and software, remote method invocation, how to deal with the implications of communications failures, distributed coordination, and the thorny issue of time in distributed systems.

通讯基础知识

Communications Basics

每个分布式系统都有通过网络进行通信的软件组件。如果移动银行应用程序请求用户当前的银行帐户余额,则会发生以下(非常简单的)通信序列:

Every distributed system has software components that communicate over a network. If a mobile banking app requests the user’s current bank account balance, a (very simplified) sequence of communications occurs along the lines of:

  1. 移动银行应用程序通过蜂窝网络向银行发送请求,以检索用户的银行余额。

  2. The mobile banking app sends a request over the cellular network addressed to the bank to retrieve the user’s bank balance.

  3. 该请求通过互联网路由到银行网络服务器所在的位置。

  4. The request is routed across the internet to where the bank’s web servers are located.

  5. 银行的 Web 服务器对请求进行身份验证(确认它源自假定的用户)并向数据库服务器发送请求以获取帐户余额。

  6. The bank’s web server authenticates the request (confirms that it originated from the supposed user) and sends a request to a database server for the account balance.

  7. 数据库服务器从磁盘读取账户余额并将其返回给Web服务器。

  8. The database server reads the account balance from disk and returns it to the web server.

  9. 网络服务器通过回复消息发送余额到应用程序,该消息通过互联网和蜂窝网络路由,直到余额神奇地出现在移动设备的屏幕上。

  10. The web server sends the balance in a reply message addressed to the app, which is routed over the internet and the cellular network until the balance magically appears on the screen of the mobile device.

当您阅读上面的内容时,听起来似乎很简单,但实际上,这一系列内容背后隐藏着巨大的复杂性。通讯。让我们在以下部分中研究其中的一些复杂性。

It almost sounds simple when you read the above, but in reality, there’s a huge amount of complexity hidden beneath this sequence of communications. Let’s examine some of these complexities in the following sections.

通讯硬件

Communications Hardware

上面的银行余额请求示例将不可避免地遍历多种不同的网络技术,并且设备。全球互联网是一台异构机器,由不同类型的网络通信通道和设备组成,每秒通过网络将数百万条消息传送到预定目的地。

The bank balance request example above will inevitably traverse multiple different networking technologies and devices. The global internet is a heterogeneous machine, comprising different types of network communications channels and devices that shuttle many millions of messages per second across networks to their intended destinations.

存在不同类型的通信渠道。最明显的分类是有线与无线。对于每个类别,都有多种网络传输硬件技术可以将比特从一台机器传输到另一台机器。每种技术都有不同的特征,我们通常关心的是速度和范围。

Different types of communications channels exist. The most obvious categorization is wired versus wireless. For each category there are multiple network transmission hardware technologies that can ship bits from one machine to another. Each technology has different characteristics, and the ones we typically care about are speed and range.

对于物理有线网络,两个最常见类型是局域网 (LAN) 和广域网 (WAN)。LAN 是可以连接“建筑规模”设备的网络,能够在少量(例如 1-2)公里内传输数据。数据中心中的现代 LAN 可以传输每秒 10 到 100 吉比特 (Gbps) 的数据。这是已知的作为网络的带宽或容量。使用现代 LAN 技术,通过 LAN 传输消息所需的时间(即网络延迟)为亚毫秒级。

For physically wired networks, the two most common types are local area networks (LANs) and wide area networks (WANs). LANs are networks that can connect devices at “building scale,” being able to transmit data over a small number (e.g., 1–2) of kilometers. Contemporary LANs in data centers can transport between 10 and 100 gigabits per second (Gbps). This is known as the network’s bandwidth, or capacity. The time taken to transmit a message across a LAN—the network’s latency—is submillisecond with modern LAN technologies.

广域网是遍布全球的网络,构成了我们统称为互联网的网络。这些长距离连接是通过光缆连接城市、国家和大陆的高速数据管道。这些电缆支持一种称为波分复用的网络技术,可以通过 400 个不同的通道传输高达 171 Gbps 的数据,为单个光纤链路提供每秒超过 70 太比特 (Tbps) 的总带宽。跨越世界的光缆通常包含四股或更多股光纤,每根光缆的带宽容量可达数百 Tbps。

WANs are networks that traverse the globe and make up what we collectively call the internet. These long-distance connections are the high speed data pipelines connecting cities, countries, and continents with fiber optic cables. These cables support a networking technology known as wavelength division multiplexing which makes it possible to transmit up to 171 Gbps over 400 different channels, giving more than 70 terabits per second (Tbps) of total bandwidth for a single fiber link. The fiber cables that span the world normally comprise four or more strands of fiber, giving bandwidth capacity of hundreds of Tbps for each cable.

延迟比较多然而,WAN 却很复杂。广域网传输数据的距离可达数百甚至数千公里,而数据在光纤电缆中传输的最大速度是理论光速。实际上,这些电缆无法达到光速,但确实非常接近光速,如表 3-1所示。

Latency is more complicated with WANs, however. WANs transmit data over hundreds and thousands of kilometers, and the maximum speed that the data can travel in fiber optic cables is the theoretical speed of light. In reality, these cables can’t reach the speed of light, but do get pretty close to it, as shown in Table 3-1.

表 3-1。广域网速度
小路 距离 旅行时间(光速) 行程时间(光缆)
纽约到旧金山 4,148 公里 14毫秒 21 毫秒
纽约到伦敦 5,585 公里 19 毫秒 28 毫秒
纽约飞往 悉尼 15,993 公里 53 毫秒 80毫秒

实际时间将比表 3-1中的光纤传输时间慢,因为数据需要通过称为路由器的网络设备。全球互联网具有复杂的中心辐射型拓扑结构,网络中的节点之间存在许多潜在路径。因此,路由器负责在物理网络连接上传输数据,以确保数据通过互联网从源传输到目的地。

Actual times will be slower than the fiber optic travel times in Table 3-1 as the data needs to pass through networking equipment known as routers. The global internet has a complex hub-and-spoke topology with many potential paths between nodes in the network. Routers are therefore responsible for transmitting data on the physical network connections to ensure data is transmitted across the internet from source to destination.

路由器是专业的、高速的设备可以处理数百 Gbps 的网络流量,从传入连接中提取数据,并根据目的地将数据发送到不同的传出网络连接。互联网核心的路由器由这些设备的机架组成,可以处理数十到数百 Tbps。这样您和成千上万的朋友就可以同时在 Netflix 上观看稳定的视频流。

Routers are specialized, high-speed devices that can handle several hundred Gbps of network traffic, pulling data off incoming connections and sending the data out to different outgoing network connections based on their destination. Routers at the core of the internet comprise racks of these devices and can process tens to hundreds of Tbps. This is how you and thousands of your friends get to watch a steady video stream on Netflix at the same time.

无线技术有不同的范围和带宽特性。我们在家庭和办公室中熟悉的 WiFi 路由器是无线以太网,使用 802.11 协议发送和接收数据。使用最广泛的 WiFi 协议 802.11ac 可实现高达 5,400 Mbps 的最大(理论)数据速率。最新的 802.11ax 协议(也称为 WiFi 6)是 802.11ac 技术的演进,有望将吞吐量速度提高至高达 9.6 Gbps。WiFi路由器的覆盖范围约为数十米,当然会受到墙壁和地板等物理障碍的影响。

Wireless technologies have different range and bandwidth characteristics. WiFi routers that we are all familiar with in our homes and offices are wireless Ethernet networks and use 802.11 protocols to send and receive data. The most widely used WiFi protocol, 802.11ac, allows for maximum (theoretical) data rates of up to 5,400 Mbps. The most recent 802.11ax protocol, also known as WiFi 6, is an evolution of 802.11ac technology that promises increased throughput speeds of up to 9.6 Gbps. The range of WiFi routers is of the order of tens of meters and of course is affected by physical impediments like walls and floors.

蜂窝无线技术的用途无线电波将数据从我们的手机发送到安装在手机信号塔上的路由器,这些路由器通常通过电线连接到核心互联网以进行消息路由。每种蜂窝技术都引入了改进的带宽和其他性能维度。在撰写本文时,最常见的技术是 4G LTE 无线宽带。4G LTE 的速度比旧 3G 快约 10 倍,能够处理约 10 Mbps 的持续下载速度(峰值下载速度接近 50 Mbps)和 2 至 5 Mbps 的上传速度。

Cellular wireless technology uses radio waves to send data from our phones to routers mounted on cell towers, which are generally connected by wires to the core internet for message routing. Each cellular technology introduces improved bandwidth and other dimensions of performance. The most common technology at the time of writing is 4G LTE wireless broadband. 4G LTE is around 10 times faster than the older 3G, able to handle sustained download speeds around 10 Mbps (peak download speeds are nearer 50 Mbps) and upload speeds between 2 and 5 Mbps.

新兴 5G 蜂窝网络承诺将现有 4G 的带宽提高 10 倍,设备和手机信号塔之间的延迟为 1-2 毫秒。与 20-40 毫秒范围内的 4G 延迟相比,这是一个巨大的改进。权衡是范围。5G 基站的工作范围最大约为 500 米,而 4G 则可在 10-15 公里的距离内提供可靠的接收。

Emerging 5G cellular networks promise 10x bandwidth improvements over existing 4G, with 1–2 millisecond latencies between devices and cell towers. This is a great improvement over 4G latencies, which are in the 20–40 millisecond range. The trade-off is range. 5G base station range operates at about 500 meters maximum, whereas 4G provides reliable reception at distances of 10–15 km.

用于网络的不同硬件类型的整个集合汇集在全球互联网中。互联网是一个异构网络,世界各地有许多不同的运营商以及可以想象的各种类型的硬件。图 3-1显示了构成互联网的主要组件的简化视图。一级网络是全球高速互联网骨干网。大约有 20 家一级互联网服务提供商 (ISP) 管理和控制全球流量。第 2 级 ISP 通常是区域性的(例如,一个国家/地区),其带宽低于第 1 级 ISP,并通过第 3 级 ISP 向客户提供内容。第三级 ISP 每月向您收取过高的家庭互联网费用。

This whole collection of different hardware types for networking comes together in the global internet. The internet is a heterogeneous network, with many different operators around the world and every type of hardware imaginable. Figure 3-1 shows a simplified view of the major components that comprise the internet. Tier 1 networks are the global high-speed internet backbone. There are around 20 Tier 1 internet service providers (ISPs) who manage and control global traffic. Tier 2 ISPs are typically regional (e.g., one country), have lower bandwidth than Tier 1 ISPs, and deliver content to customers through Tier 3 ISPs. Tier 3 ISPs are the ones that charge you exorbitant fees for your home internet every month.

互联网的简化视图
图 3-1。互联网的简化视图

互联网的工作原理比这里描述的要复杂得多。网络和协议的复杂程度超出了本章的范围。从分布式系统软件的角度来看,我们需要更多地了解启用所有这些硬件的“魔力”例如,将消息从我的手机传送到我的银行并返回。这就是互联网协议(IP)的用武之地。

There’s a lot more complexity to how the internet works than described here. That level of networking and protocol complexity is beyond the scope of this chapter. From a distributed systems software perspective, we need to understand more about the “magic” that enables all this hardware to route messages from, say, my cell phone to my bank and back. This is where the Internet Protocol (IP) comes in.

通讯软件

Communications Software

互联网上的软件系统使用互联网协议(IP)套件进行通信。IP 套件指定主机寻址、数据传输格式、消息路由和传送特性。有四个抽象层,其中包含支持该层所需功能的相关协议。这些是,从最低到最高:

Software systems on the internet communicate using the Internet Protocol (IP) suite. The IP suite specifies host addressing, data transmission formats, message routing, and delivery characteristics. There are four abstract layers, which contain related protocols that support the functionality required at that layer. These are, from lowest to highest:

  1. 数据链路层,指定通信跨单个网段的数据的方法。这是由设备内部的设备驱动程序和网卡实现的。

  2. The data link layer, specifying communication methods for data across a single network segment. This is implemented by the device drivers and network cards that live inside your devices.

  3. 互联网层指定寻址和路由协议,使流量能够穿越构成互联网的独立管理和控制的网络。这是互联网协议族中的IP 层。

  4. The internet layer specifies addressing and routing protocols that make it possible for traffic to traverse the independently managed and controlled networks that comprise the internet. This is the IP layer in the internet protocol suite.

  5. 传输层,指定用于可靠且尽力而为的主机到主机通信的协议。这就是著名的传输控制协议(TCP)和用户数据报协议(UDP)的所在地。

  6. The transport layer, specifying protocols for reliable and best-effort, host-to-host communications. This is where the well-known Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) live.

  7. 应用层,其中包括多个应用程序级协议,例如 HTTP 和安全复制协议 (SCP)。

  8. The application layer, which comprises several application-level protocols such as HTTP and the secure copy protocol (SCP).

每个较高层协议都建立在较低层的功能之上。在下一节中,我将简要介绍用于主机发现和消息路由的 IP,以及分布式应用程序可以使用的 TCP 和 UDP。

Each of the higher-layer protocols builds on the features of the lower layers. In the following section, I’ll briefly cover IP for host discovery and message routing, and TCP and UDP that can be utilized by distributed applications.

互联网协议 (IP)

Internet Protocol (IP)

IP 定义主机如何分配地址互联网上的信息以及如何在知道彼此地址的两个主机之间传输消息。

IP defines how hosts are assigned addresses on the internet and how messages are transmitted between two hosts who know each other’s addresses.

互联网上的每个设备都有自己的地址。这些称为互联网协议 (IP) 地址。这可以使用称为域名系统 (DNS) 的互联网范围的目录服务找到 IP 地址的位置。DNS 是一种广泛分布的分层数据库,充当互联网的地址簿。

Every device on the internet has its own address. These are known as Internet Protocol (IP) addresses. The location of an IP address can be found using an internet-wide directory service known as Domain Name System (DNS). DNS is a widely distributed, hierarchical database that acts as the address book of the internet.

目前用于分配IP的技术地址,称为互联网协议版本 4 (IPv4),最终将被其后继者 IPv6 所取代。IPv4 是一种 32 位寻址方案,由于连接到互联网的设备数量过多,不久之后就会耗尽地址。IPv6 是一种 128 位方案,将提供(几乎)无限数量的 IP 地址。作为一项指标,2020 年 7 月Google.com 处理的流量中约有 33%是 IPv6。

The technology currently used to assign IP addresses, known as Internet Protocol version 4 (IPv4), will eventually be replaced by its successor, IPv6. IPv4 is a 32-bit addressing scheme that before long will run out of addresses due to the number of devices connecting to the internet. IPv6 is a 128-bit scheme that will offer an (almost) infinite number of IP addresses. As an indicator, in July 2020 about 33% of the traffic processed by Google.com is IPv6.

DNS服务器是有组织的分层地。少量高度复制的根 DNS 服务器是解析 IP 地址的起点。当互联网浏览器尝试查找网站时,称为本地 DNS 服务器(由您的雇主或 ISP 管理)的网络主机将使用所请求的主机名联系根 DNS 服务器。根服务器回复一个所谓的权威DNS 服务器的引用,该服务器管理.com地址的名称解析(在我们的银行示例中)。每个顶级互联网域(.com.org.net等)都有一个权威名称服务器。

DNS servers are organized hierarchically. A small number of root DNS servers, which are highly replicated, are the starting point for resolving an IP address. When an internet browser tries to find a website, a network host known as the local DNS server (managed by your employer or ISP) will contact a root DNS server with the requested hostname. The root server replies with a referral to a so-called authoritative DNS server that manages name resolution for, in our banking example, .com addresses. There is an authoritative name server for each top-level internet domain (.com, .org, .net, etc.).

接下来,本地 DNS 服务器将查询.com DNS 服务器,后者将回复知道igbank.com管理的所有 IP 地址的 DNS 服务器的地址。查询此 DNS,它会返回我们与应用程序通信所需的实际 IP 地址。总体方案如图3-2所示。

Next, the local DNS server will query the .com DNS server, which will reply with the address of the DNS server that knows about all the IP addresses managed by igbank.com. This DNS is queried, and it returns the actual IP address we need to communicate with the application. The overall scheme is illustrated in Figure 3-2.

igbank.com 的 DNS 查找示例
图 3-2。igbank.com 的 DNS 查找示例

整个DNS数据库高度地理上复制,因此不存在单点故障,并且请求分布在多个物理服务器上。本地 DNS 服务器还会记住最近联系的主机的 IP 地址,这是可能的,因为 IP 地址不会经常更改。这意味着我们联系的每个站点都不会发生完整的名称解析过程。

The whole DNS database is highly geographically replicated so there are no single points of failure, and requests are spread across multiple physical servers. Local DNS servers also remember the IP addresses of recently contacted hosts, which is possible as IP addresses don’t change very often. This means the complete name resolution process doesn’t occur for every site we contact.

有了目标 IP 地址,主机就可以开始通过网络以一系列 IP 数据包的形式发送数据。IP 根据数据包标头中的 IP 地址将数据从源主机传送到目标主机。IP 定义了一个数据包结构,其中包含要传送的数据以及包括源和目标 IP 地址的标头数据。应用程序发送的数据被分解为一系列数据包,这些数据包在互联网上独立传输。

Armed with a destination IP address, a host can start sending data across the network as a series of IP data packets. IP delivers data from the source to the destination host based on the IP addresses in the packet headers. IP defines a packet structure that contains the data to be delivered, along with header data including source and destination IP addresses. Data sent by an application is broken up into a series of packets which are independently transmitted across the internet.

IP 被称为尽力交付协议。这意味着它不会尝试补偿数据包传输期间可能发生的各种错误情况。可能的传输错误包括数据损坏、数据包丢失和重复。此外,每个数据包都通过互联网从源到目的地独立路由。独立处理每个数据包称为数据包交换。这使得网络能够动态响应网络链路故障和拥塞等情况,因此是互联网的一个定义特征。然而,这确实意味着不同的数据包可能通过不同的网络路径传递到同一目的地,从而导致无序传递到接收者。

IP is known as a best-effort delivery protocol. This means it does not attempt to compensate for the various error conditions that can occur during packet transmission. Possible transmission errors include data corruption, packet loss, and duplication. In addition, every packet is routed across the internet from source to destination independently. Treating every packet independently is known as packet switching. This allows the network to dynamically respond to conditions such as network link failure and congestion, and hence is a defining characteristic of the internet. This does mean, however, that different packets may be delivered to the same destination via different network paths, resulting in out-of-order delivery to the receiver.

由于这种设计,IP 不可靠。如果两台主机需要可靠的数据传输,则它们需要添加额外的功能来实现这一点。这是IP协议的下一层传输层套件登场。

Because of this design, the IP is unreliable. If two hosts require reliable data transmission, they need to add additional features to make this occur. This is where the next layer in the IP protocol suite, the transport layer, enters the scene.

传输控制协议 (TCP)

Transmission Control Protocol (TCP)

一旦应用程序或浏览器发现它希望与之通信的服务器的 IP 地址后,它可以使用传输协议 API 发送消息。这是使用 TCP 或 UDP 来实现的,它们是 IP 网络堆栈的流行标准传输协议。

Once an application or browser has discovered the IP address of the server it wishes to communicate with, it can send messages using a transport protocol API. This is achieved using TCP or UDP, which are the popular standard transport protocols for the IP network stack.

分布式应用程序可以选择使用这些协议中的哪一个。其实现广泛适用于 Java、Python 和 C++ 等主流编程语言。实际上,这些 API 的使用并不常见,因为高级编程抽象对大多数应用程序隐藏了细节。事实上,IP协议族应用层包含了几个这样的应用级API,其中包括在主流分布式系统中应用非常广泛的HTTP。

Distributed applications can choose which of these protocols to use. Implementations are widely available in mainstream programming languages such as Java, Python, and C++. In reality, use of these APIs is not common as higher-level programming abstractions hide the details from most applications. In fact, the IP protocol suite application layer contains several of these application-level APIs, including HTTP, which is very widely used in mainstream distributed systems.

尽管如此,了解 TCP、UDP 及其差异仍然很重要。Internet 上的大多数请求都是使用 TCP 发送的。TCP 是:

Still, it’s important to understand TCP, UDP, and their differences. Most requests on the internet are sent using TCP. TCP is:

  • 面向连接

  • Connection-oriented

  • 面向流

  • Stream-oriented

  • 可靠的

  • Reliable

我将在下面解释每一个品质以及它们为何重要。

I’ll explain each of these qualities, and why they matter, below.

TCP 被称为面向连接的协议。在应用程序之间交换任何消息之前,TCP 使用三步握手在客户端和服务器应用程序之间建立双向连接。连接保持打开状态,直到 TCP 客户端close()调用终止与 TCP 服务器的连接。服务器close()在连接断开之前通过确认请求进行响应。

TCP is known as a connection-oriented protocol. Before any messages are exchanged between applications, TCP uses a three-step handshake to establish a two-way connection between the client and server applications. The connection stays open until the TCP client calls close() to terminate the connection with the TCP server. The server responds by acknowledging the close() request before the connection is dropped.

一旦建立连接,客户端就会以数据流的形式向服务器发送一系列请求。当数据流通过 TCP 发送时,它会被分解为单独的网络数据包,最大数据包大小为 65,535 字节。每个数据包都包含一个源地址和目标地址,底层 IP 协议使用这些地址在网络上路由消息。

Once a connection is established, a client sends a sequence of requests to the server as a data stream. When a data stream is sent over TCP, it is broken up into individual network packets, with a maximum packet size of 65,535 bytes. Each packet contains a source and destination address, which is used by the underlying IP protocol to route the messages across the network.

互联网是一个数据包交换网络,这意味着每个数据包都在网络上单独路由。每个数据包经过的路由可以根据网络条件(例如链路拥塞或故障)动态变化。这意味着数据包可能不会按照从客户端发送的顺序到达服务器。为了解决这个问题,TCP 发送方在每个数据包中包含一个序列号,以便接收方可以将数据包重新组装成与发送顺序相同的流。

The internet is a packet switched network, which means every packet is individually routed across the network. The route each packet traverses can vary dynamically based on the conditions in the network, such as link congestion or failure. This means the packets may not arrive at the server in the same order they are sent from the client. To solve this problem, a TCP sender includes a sequence number in each packet so the receiver can reassemble packets into a stream that is identical to the order they were sent.

由于网络数据包在发送方和接收方之间传输期间可能会丢失或延迟,因此需要可靠性。为了实现可靠的数据包传送,TCP 使用累积确认机制。这意味着接收方将定期发送确认数据包,其中包含接收到的数据包的最高序列号,且数据包流中没有间隙。这隐式确认所有使用较低序列号发送的数据包,意味着所有数据包均已成功接收。如果发送方在超时时间内未收到确认,则会重新发送数据包。

Reliability is needed as network packets can be lost or delayed during transmission between sender and receiver. To achieve reliable packet delivery, TCP uses a cumulative acknowledgment mechanism. This means a receiver will periodically send an acknowledgment packet that contains the highest sequence number of the packets received without gaps in the packet stream. This implicitly acknowledges all packets sent with a lower sequence number, meaning all have been successfully received. If a sender doesn’t receive an acknowledgment within a timeout period, the packet is resent.

TCP 还有许多其他功能,例如用于检查数据包完整性的校验和,以及用于确保发送方不会因发送数据过快而压垮慢速接收方的动态流量控制。与连接建立和确认一起,这使得 TCP 成为相对重量级的协议,它牺牲了可靠性和效率。

TCP has many other features, such as checksums to check packet integrity, and dynamic flow control to ensure a sender doesn’t overwhelm a slow receiver by sending data too quickly. Along with connection establishment and acknowledgments, this makes TCP a relatively heavyweight protocol, which trades off reliability over efficiency.

这就是 UDP 发挥作用的地方。UDP 是一种简单的、无连接的协议,它使用户的程序暴露于底层网络的任何不可靠性。无法保证交付会按照规定的顺序进行,或者根本不会发生。它可以被认为是底层 IP 协议之上的一层薄单板(层),并故意在可靠性与原始性能之间进行权衡。

This is where UDP comes into the picture. UDP is a simple, connectionless protocol, which exposes the user’s program to any unreliability of the underlying network. There is no guarantee that delivery will occur in a prescribed order, or that it will happen at all. It can be thought of as a thin veneer (layer) on top of the underlying IP protocol, and deliberately trades off raw performance over reliability.

然而,这非常适合许多现代应用程序,其中奇数丢失的数据包影响很小。想想流媒体电影、视频会议和游戏,用户不太可能察觉到丢失的数据包。

This, however, is highly appropriate for many modern applications where the odd lost packet has very little effect. Think streaming movies, video conferencing, and gaming, where one lost packet is unlikely to be perceptible by a user.

图 3-3描述了 TCP 和 UDP 之间的一些主要区别。TCP 包含连接建立三数据包握手(SYN、SYN ACK、ACK)和数据包捎带确认(ACK),以便协议可以处理任何数据包丢失。还有一个 TCP 连接关闭阶段,涉及四次握手(图中未显示)。UDP 无需连接建立、断开、确认和重试。因此,使用UDP的应用程序需要容忍丢包,客户端或服务器失败(并采取相应的行为)。

Figure 3-3 depicts some of the major differences between TCP and UDP. TCP incorporates a connection establishment three-packet handshake (SYN, SYN ACK, ACK), and piggybacks acknowledgments (ACK) of packets so that any packet loss can be handled by the protocol. There’s also a TCP connection close phase involving a four-way handshake that is not shown in the diagram. UDP dispenses with connection establishment, tear down, acknowledgments, and retries. Therefore, applications using UDP need to be tolerant of packet loss and client or server failures (and behave accordingly).

比较 TCP 和 UDP
图 3-3。比较 TCP 和 UDP

远程方法调用

Remote Method Invocation

使用直接交互的低级 API 来编写分布式应用程序是完全可行的传输层协议 TCP 和 UDP。最常见的方法是标准化套接字库 - 请参阅侧栏中的简要概述。这是您希望永远不需要做的事情,因为套接字很复杂并且容易出错。本质上,套接字在两个节点之间创建一个双向管道,可用于发送数据流。(幸运的是)有更好的方法来构建分布式通信,正如我将在本节中描述的那样。这些方法消除了使用套接字的大部分复杂性。然而,套接字仍然潜伏在下面,所以一些知识是必要的。

It’s perfectly feasible to write our distributed applications using low-level APIs that interact directly with the transport layer protocols TCP and UDP. The most common approach is the standardized sockets library—see the brief overview in the sidebar. This is something you’ll hopefully never need to do, as sockets are complex and error prone. Essentially, sockets create a bidirectional pipe between two nodes that you can use to send streams of data. There are (luckily) much better ways to build distributed communications, as I’ll describe in this section. These approaches abstract away much of the complexity of using sockets. However, sockets still lurk underneath, so some knowledge is necessary.

在我们的移动银行示例中,客户端可能会使用套接字请求用户支票帐户的余额。忽略特定的语言问题(和安全性!),客户端可以通过与服务器的连接发送如下消息有效负载:

In our mobile banking example, the client might request a balance for the user’s checking account using sockets. Ignoring specific language issues (and security!), the client could send a message payload as follows over a connection to the server:

{“余额”,“000169990”}

{“balance”, “000169990”}

在这条消息中,“balance”代表我们希望服务器执行的操作,“000169990”是银行帐号。

In this message, “balance” represents the operation we want the server to execute, and “000169990” is the bank account number.

在服务器中,我们需要知道消息中的第一个字符串是操作标识符,基于该值为“余额”,第二个字符串是银行帐号。然后,服务器可能会使用这些值来查询数据库、检索余额并发回结果,可能会以包含帐号和当前余额的消息的形式发回,如下所示:

In the server, we need to know that the first string in the message is the operation identifier, and based on this value being “balance”, the second is the bank account number. The server then uses these values to presumably query a database, retrieve the balance, and send back the results, perhaps as a message formatted with the account number and current balance, as below:

{“000169990”,“220.77”}

{“000169990”, “220.77”}

在任何复杂的系统中,服务器都会支持许多操作。在igbank.com中,可能有“登录”、“转账”、“地址”、“声明”、“交易”等。每个消息后面都会跟着不同的消息有效负载,服务器需要正确解释这些消息负载才能满足客户端的请求。

In any complex system, the server will support many operations. In igbank.com, there might be for example “login”, “transfer”, “address”, “statement”, “transactions”, and so on. Each will be followed by different message payloads that the server needs to interpret correctly to fulfill the client’s request.

我们在这里定义的是特定于应用程序的协议。只要我们为每个操作以正确的顺序发送必要的值,服务器就能够正确响应。如果我们有一个不遵守我们的应用程序协议的错误客户端,那么我们的服务器需要进行彻底的错误检查。套接字库为客户端/服务器通信提供了一种原始的低级方法。它提供了高效通信,但很难正确实现和发展应用程序协议来处理所有可能性。有更好的机制。

What we are defining here is an application-specific protocol. As long as we send the necessary values in the correct order for each operation, the server will be able to respond correctly. If we have an erroneous client that doesn’t adhere to our application protocol, well, our server needs to do thorough error checking. The socket library provides a primitive, low-level method for client/server communications. It provides highly efficient communications but is difficult to correctly implement and evolve the application protocol to handle all possibilities. There are better mechanisms.

退一步来说,如果我们用面向对象的语言(例如 Java)定义igbank.com服务器接口,我们会将它可以处理的每个操作作为一个方法。每个方法都会传递一个适合该操作的参数列表,如以下示例代码所示:

Stepping back, if we were defining the igbank.com server interface in an object-oriented language such as Java, we would have each operation it can process as a method. Each method is passed an appropriate parameter list for that operation, as shown in this example code:

// 简单的 igbank.com 服务器接口
公共接口 IGBank {
    公共浮动余额(String accNo);
    公共布尔语句(字符串月);
    // 其他操作
}
// Simple igbank.com server interface
public interface IGBank {
    public float balance  (String accNo);
    public boolean  statement(String month) ;
    // other operations
}

拥有这样的接口有几个优点,即:

There are several advantages of having such an interface, namely:

  • 编译器可以静态检查从客户端到服务器的调用,以确保它们具有正确的格式和参数类型。

  • Calls from the client to the server can be statically checked by the compiler to ensure they are of the correct format and argument types.

  • 服务器接口的更改(例如,添加新参数)强制客户端代码中的更改遵循新的方法签名。

  • Changes in the server interface (e.g., adding a new parameter) force changes in the client code to adhere to the new method signature.

  • 该接口由类定义明确定义,因此客户端程序员可以轻松理解和使用。

  • The interface is clearly defined by the class definition and thus straightforward for a client programmer to understand and utilize.

显式接口的这些好处在软件工程中当然是众所周知的。面向对象设计的整个规则很大程度上基于这些基础,其中接口定义了调用者和被调用者之间的契约。与我们使用套接字需要遵循的隐式应用程序协议相比,优点是显着的。

These benefits of an explicit interface are of course well known in software engineering. The whole discipline of object-oriented design is pretty much based upon these foundations, where an interface defines a contract between the caller and callee. Compared to the implicit application protocol we need to follow with sockets, the advantages are significant.

这一事实在分布式系统创建之初就得到了合理的认识。自 20 世纪 90 年代初以来,我们看到了技术的发展,这些技术使我们能够定义显式的服务器接口,并使用与顺序程序中基本相同的语法在网络上调用这些接口。表 3-2总结了主要方法。总的来说,他们是众所周知的例如远程过程调用 (RPC) 或远程方法调用 (RMI) 技术。

This fact was recognized reasonably early in the creation of distributed systems. Since the early 1990s, we have seen an evolution of technologies that enable us to define explicit server interfaces and call these across the network using essentially the same syntax as we would in a sequential program. A summary of the major approaches is given in Table 3-2. Collectively, they are known as Remote Procedure Call (RPC), or Remote Method Invocation (RMI) technologies.

表 3-2。主要RPC/RMI技术总结
技术 日期 主要特点
分布式计算环境(DCE) 20世纪90年代初 DCE RPC 为客户端/服务器系统提供了一种标准化方法。主要语言是 C/C++。
通用对象请求代理架构 (CORBA) 20世纪90年代初 基于面向对象的接口定义语言 (IDL) 促进语言中立的客户端/服务器通信。主要语言支持 C/C++、Java、Python 和 Ada。
Java 远程方法调用 (RMI) 20世纪90年代末 纯基于 Java 的远程方法调用,有助于分布式客户端/服务器系统与 Java 对象具有相同的语义。
XML 网络服务 2000年 支持基于 HTTP 和 XML 的客户端/服务器通信。服务器使用 Web 服务描述语言 (WSDL) 定义其远程接口。
远程过程调用 2015年 开源,基于HTTP/2进行传输,并使用Protocol Buffers(Protobuf)作为接口描述语言

虽然这些 RPC/RMI 技术的语法和语义各不相同,但每种技术的操作本质是相同的。让我们继续使用igbank.com的 Java 示例来检查整个方法类。Java 提供了用于构建客户端/服务器应用程序的远程方法调用 (RMI) API。

While the syntax and semantics of these RPC/RMI technologies vary, the essence of how each operates is the same. Let’s continue with our Java example of igbank.com to examine the whole class of approaches. Java offers a Remote Method Invocation (RMI) API for building client/server applications.

使用 Java RMI,我们可以轻松地将IGBank上面的接口示例变成远程接口,如以下代码所示:

Using Java RMI, we can trivially make our IGBank interface example from above into a remote interface, as illustrated in the following code:

导入java.rmi.*;
// 简单的 igbank.com 服务器接口
公共接口 IGBank extends Remote{
    公共浮动余额(字符串 accNo)
         投掷RemoteException;
    公共布尔语句(字符串月)
         投掷RemoteException ;
    // 其他操作
 }
import java.rmi.*;
// Simple igbank.com server interface
public interface IGBank extends Remote{
    public float balance  (String accNo)
         throws RemoteException;
    public boolean  statement(String month)
         throws RemoteException ;
    // other operations
 }

java.rmi.Remote接口充当一个标记,通知 Java 编译器我们正在创建一个 RMI 服务器。此外,每个方法都必须抛出java.rmi.RemoteException。这些异常表示通过网络调用两个对象之间的分布式调用时可能发生的错误。此类异常的最常见原因是通信故障或服务器对象崩溃。

The java.rmi.Remote interface serves as a marker to inform the Java compiler we are creating an RMI server. In addition, each method must throw java.rmi.RemoteException. These exceptions represent errors that can occur when a distributed call between two objects is invoked over a network. The most common reasons for such an exception would be a communications failure or the server object having crashed.

然后我们必须提供一个实现此远程接口的类。下面的示例代码显示了服务器实现的摘录:

We then must provide a class that implements this remote interface. The sample code below shows an extract of the server implementation:

公共类 IGBankServer 扩展 UnicastRemoteObject
                          实施 IGBank {
   // 省略构造函数/方法实现
   公共静态无效主(字符串参数[]){  
        尝试{  
          IGBankServer 服务器 = new IGBankServer();  
          // 在默认端口上的本地 JVM 中创建注册表
          注册表registry = LocateRegistry.createRegistry(1099);
          registry.bind("IGBankServer", 服务器);
          System.out.println("服务器就绪");
        }捕获(异常e){
                 // 为简洁起见省略代码}  
        }  
   }
public class IGBankServer extends UnicastRemoteObject 
                          implements IGBank  {
   // constructor/method implementations omitted
   public static void main(String args[]){  
        try{  
          IGBankServer server=new IGBankServer();  
          // create a registry in local JVM on default port
          Registry registry = LocateRegistry.createRegistry(1099);
          registry.bind("IGBankServer", server);
          System.out.println("server ready");
        }catch(Exception e){
                 // code omitted for brevity}  
        }  
   }

需要注意的点是:

Points to note are:

  • 服务器扩展了UnicastRemoteObject类。这本质上提供了实例化远程可调用对象的功能。

  • The server extends the UnicastRemoteObject class. This essentially provides the functionality to instantiate a remotely callable object.

  • 一旦构建了服务器对象,就必须将其可用性通告给远程客户端。这是通过在系统中存储对该对象的引用来实现的称为RMI 注册表的服务,并将逻辑名称与其关联起来 — 在本例中为“ IGBankServer ”。注册表是一种简单的目录服务,使客户端能够通过简单地提供与注册表中关联的逻辑名称来查找 RMI 服务器的位置(网络地址和对象引用)并获取对 RMI 服务器的引用。

  • Once the server object is constructed, its availability must be advertised to remote clients. This is achieved by storing a reference to the object in a system service known as the RMI registry, and associating a logical name with it—in this example, “IGBankServer.” The registry is a simple directory service that enables clients to look up the location (network address and object reference) of and obtain a reference to an RMI server by simply supplying the logical name it is associated with in the registry.

以下示例显示了连接到服务器的客户端代码的摘录。lookup它通过在 RMI 注册表中执行操作并指定标识服务器的逻辑名称来获取对远程对象的引用。然后,可以使用查找操作返回的引用以与本地对象相同的方式调用服务器对象。然而,有一个区别——客户端必须准备好捕获RemoteException当无法到达服务器对象时 Java 运行时将抛出的异常:

An extract from the client code to connect to the server is shown in the following example. It obtains a reference to the remote object by performing a lookup operation in the RMI registry and specifying the logical name that identifies the server. The reference returned by the lookup operation can then be used to call the server object in the same manner as a local object. However, there is a difference—the client must be ready to catch a RemoteException that will be thrown by the Java runtime when the server object cannot be reached:

// 获取对服务器的远程引用
 IGBank银行服务器=
        (IGBank)Naming.lookup("rmi://localhost:1099/IGBankServer");  
 //现在我们可以调用服务器了
 System.out.println(bankServer.balance("00169990"));
 // obtain a remote reference to the server
 IGBank bankServer=
        (IGBank)Naming.lookup("rmi://localhost:1099/IGBankServer");  
 //now we can call the server
 System.out.println(bankServer.balance("00169990"));

图 3-4描述了组成 RMI 系统的组件之间的调用顺序。和是编译器根据 RMI 接口定义生成的对象,这些对象有助于实际的远程通信StubSkeleton该框架实际上是一个 TCP 网络端点(主机、端口),用于侦听对关联服务器的调用。

Figure 3-4 depicts the call sequence among the components that comprise an RMI system. The Stub and Skeleton are objects generated by the compiler from the RMI interface definition, and these facilitate the actual remote communications. The skeleton is in fact a TCP network endpoint (host, port) that listens for calls to the associated server.

描述建立连接和调用 RMI 服务器对象的调用序列的示意图
图 3-4。描述建立连接和调用 RMI 服务器对象的调用序列的示意图

操作顺序如下:

The sequence of operations is as follows:

  1. 当服务器启动时,其逻辑引用存储在 RMI 注册表中。该条目包含可用于对服务器进行远程调用的 Java 客户端存根。

  2. When the server starts, its logical reference is stored in the RMI registry. This entry contains the Java client stub that can be used to make remote calls to the server.

  3. 客户端查询注册表,并返回服务器的存根。

  4. The client queries the registry, and the stub for the server is returned.

  5. 客户端存根接受来自 Java 客户端实现的对服务器接口的方法调用。

  6. The client stub accepts a method call to the server interface from the Java client implementation.

  7. 存根将请求转换为一个或多个发送到服务器主机的网络数据包。这个转换过程称为编组。

  8. The stub transforms the request into one or more network packets that are sent to the server host. This transformation process is known as marshalling.

  9. 该框架接受来自客户端的网络请求,并将网络数据包数据解组为对 RMI 服务器对象实现的有效调用。解组与编组相反,它采用一系列网络数据包并将它们转换为对对象的调用。

  10. The skeleton accepts network requests from the client, and unmarshalls the network packet data into a valid call to the RMI server object implementation. Unmarshalling is the opposite of marshalling—it takes a sequence of network packets and transforms them into a call to an object.

  11. 骨架等待方法返回响应。

  12. The skeleton waits for the method to return a response.

  13. 骨架将方法结果编组到返回给客户端的网络回复数据包中。

  14. The skeleton marshalls the method results into a network reply packet that is returned to the client.

  15. 存根解组数据并将结果传递到 Java 客户端调用站点。

  16. The stub unmarshalls the data and passes the result to the Java client call site.

这个 Java RMI 示例说明了用于实现任何 RPC/RMI 机制的基础知识,甚至在ErlangGo等现代语言中也是如此。在使用 Java Enterprise JavaBeans (EJB) 技术时,您最有可能遇到 Java RMI。EJB 是一个基于 RMI 构建的服务器端组件模型,在过去 20 年左右的时间里在企业系统中得到了广泛的应用。

This Java RMI example illustrates the basics that are used for implementing any RPC/RMI mechanism, even in modern languages like Erlang and Go. You are most likely to encounter Java RMI when using the Java Enterprise JavaBeans (EJB) technology. EJBs are a server-side component model built on RMI, which have seen wide usage in the last 20 or so years in enterprise systems.

不管具体的实现如何,RPC/RMI 方法的基本吸引力在于提供一种抽象的调用机制,支持客户端进行远程服务器调用的位置透明性。位置透明性由注册表提供,或者通常由使客户端能够通过目录服务定位服务器的任何机制提供。这意味着服务器可以更新其在目录中的网络位置,而不会影响客户端实现。

Regardless of the precise implementation, the basic attraction of RPC/RMI approaches is to provide an abstract calling mechanism that supports location transparency for clients making remote server calls. Location transparency is provided by the registry, or in general any mechanism that enables a client to locate a server through a directory service. This means it is possible for the server to update its network location in the directory without affecting the client implementation.

RPC/RMI 并非没有缺陷。对于复杂的对象参数,编组和解编组可能会变得低效。跨语言编组(客户端使用一种语言,服务器使用另一种语言)可能会导致问题,因为类型在不同语言中的表示方式不同,从而导致微妙的不兼容性。如果远程方法签名发生更改,所有客户端都需要获取新的兼容存根,这在大型部署中可能会很麻烦。

RPC/RMI is not without its flaws. Marshalling and unmarshalling can become inefficient for complex object parameters. Cross-language marshalling—client in one language, server in another—can cause problems due to types being represented differently in different languages, causing subtle incompatibilities. And if a remote method signature changes, all clients need to obtain a new compatible stub, which can be cumbersome in large deployments.

由于这些原因,大多数现代系统都是围绕基于 HTTP 的更简单的协议构建的,并使用 JSON 进行参数表示。HTTP 动词(PUTGETPOST等)具有映射到特定 URL 的关联语义,而不是操作名称。这种方法起源于 Roy Fielding 关于 REST 方法的工作。1 REST 具有一组语义,构成RESTful架构风格,并且实际上大多数系统不遵守这些。我们将在第 5 章中讨论 REST 和 HTTP API 机制。

For these reasons, most modern systems are built around simpler protocols based on HTTP and using JSON for parameter representation. Instead of operation names, HTTP verbs (PUT, GET, POST, etc.) have associated semantics that are mapped to a specific URL. This approach originated in the work by Roy Fielding on the REST approach.1 REST has a set of semantics that comprise a RESTful architecture style, and in reality most systems do not adhere to these. We’ll discuss REST and HTTP API mechanisms in Chapter 5.

部分故障

Partial Failures

分布式系统的组件通过网络进行通信。在通信技术术语中,我们的系统通过共享的局域网和广域网进行通信,称为异步网络。

The components of distributed systems communicate over a network. In communications technology terminology, the shared local and wide area networks that our systems communicate over are known as asynchronous networks.

对于异步网络:

With asynchronous networks:

  • 节点可以随时选择向其他节点发送数据。

  • Nodes can choose to send data to other nodes at any time.

  • 网络是半双工的,这意味着一个节点发送请求并且必须等待另一节点的响应。这是两个单独的通信。

  • The network is half-duplex, meaning that one node sends a request and must wait for a response from the other. These are two separate communications.

  • 由于网络拥塞、动态数据包路由和瞬时网络连接故障等原因,节点之间传输数据的时间是可变的。

  • The time for data to be communicated between nodes is variable, due to reasons like network congestion, dynamic packet routing, and transient network connection failures.

  • 由于软件或机器崩溃,接收节点可能不可用。

  • The receiving node may not be available due to a software or machine crash.

  • 数据可能会丢失。在无线网络中,数据包可能会因信号较弱或干扰而损坏并因此被丢弃。互联网路由器可能会在拥塞期间丢弃数据包。

  • Data can be lost. In wireless networks, packets can be corrupted and hence dropped due to weak signals or interference. Internet routers can drop packets during congestion.

  • 节点没有相同的内部时钟;因此它们不同步。

  • Nodes do not have identical internal clocks; hence they are not synchronized.

笔记

这与同步网络形成对比,同步网络本质上是全双工的,同时在两个方向上传输数据,每个节点具有相同的同步时钟

This is in contrast with synchronous networks, which essentially are full duplex, transmitting data in both directions at the same time with each node having an identical clock for synchronization.

这对我们的应用程序意味着什么?那么,简单来说,当客户端向服务器发送请求时,要等待多长时间才能收到回复?只是服务器节点速度慢吗?网络是否拥塞并且数据包已被路由器丢弃?如果客户没有得到回复,该怎么办?

What does this mean for our applications? Well, put simply, when a client sends a request to a server, how long does it wait until it receives a reply? Is the server node just being slow? Is the network congested and the packet has been dropped by a router? If the client doesn’t get a reply, what should it do?

让我们详细探讨这些场景。这里的核心问题,即是否以及何时收到响应,被称为处理部分失败,一般情况如图3-5所示。

Let’s explore these scenarios in detail. The core problem here, namely whether and when a response is received, is known as handling partial failures, and the general situation is depicted in Figure 3-5.

处理部分失败
图 3-5。处理部分失败

当客户端希望连接到服务器并交换消息时,可能会发生以下结果:

When a client wishes to connect to a server and exchanges messages, the following outcomes may occur:

  • 请求成功并收到快速响应。一切都很好。(实际上,几乎每个请求都会出现这种结果。几乎是这里的关键词。)

  • The request succeeds and a rapid response is received. All is well. (In reality, this outcome occurs for almost every request. Almost is the operative word here.)

  • 目标 IP 地址查找可能会失败。在这种情况下,客户端会快速收到错误消息并采取相应措施。

  • The destination IP address lookup may fail. In this case, the client rapidly receives an error message and can act accordingly.

  • IP 地址有效,但目标节点或目标服务器进程失败。发送者将收到超时错误消息并可以通知用户。

  • The IP address is valid but the destination node or target server process has failed. The sender will receive a timeout error message and can inform the user.

  • 目标服务器接收请求,但在处理请求时失败,并且从未发送任何响应。

  • The request is received by the target server, which fails while processing the request and no response is ever sent.

  • 该请求被负载较重的目标服务器接收。它处理请求但需要很长时间(例如,34 秒)来响应。

  • The request is received by the target server, which is heavily loaded. It processes the request but takes a long time (e.g., 34 seconds) to respond.

  • 目标服务器接收请求并发送响应。但由于网络故障,客户端收不到响应。

  • The request is received by the target server and a response is sent. However, the response is not received by the client due to a network failure.

前三点对于客户来说很容易处理,因为可以很快收到响应。来自服务器的结果或错误消息——要么允许客户端继续。能够快速检测到的故障很容易处理。

The first three points are easy for the client to handle, as a response is received rapidly. A result from the server or an error message—either allows the client to proceed. Failures that can be detected quickly are easy to deal with.

其余的结果给客户带来了问题。他们没有提供任何有关未收到答复的原因的见解。从客户的角度来看,这三个结果看起来完全相同。如果不等待(可能永远),客户端就无法知道响应是否最终到达或永​​远不会到达;永远的等待并不能完成很多工作。

The rest of the outcomes pose a problem for the client. They do not provide any insight into the reason why a response has not been received. From the client’s perspective, these three outcomes look exactly the same. The client cannot know without waiting (potentially forever) whether the response will arrive eventually or never arrive; waiting forever doesn’t get much work done.

更阴险的是,客户端无法知道操作是否成功并且服务器或网络故障导致结果永远无法到达,或者请求是否正在路上(仅由于网络/服务器拥塞而延迟)。这些故障统称为崩溃故障

More insidiously, the client cannot know if the operation succeeded and a server or network failure caused the result to never arrive, or if the request is on its way—delayed simply due to congestion in the network/server. These faults are collectively known as crash faults.

客户端处理崩溃故障的典型解决方案是在配置的超时时间后重新发送请求。然而,如图 3-6所示,这充满了危险。客户端向服务器发送请求,将钱存入银行帐户。当超时后未收到响应时,它会重新发送请求。由此产生的余额是多少?服务器可能已应用押金,也可能未应用押金,具体取决于部分故障情况。

The typical solution that clients adopt to handle crash faults is to resend the request after a configured timeout period. However, this is fraught with danger, as Figure 3-6 illustrates. The client sends a request to the server to deposit money in a bank account. When it receives no response after a timeout period, it resends the request. What is the resulting balance? The server may have applied the deposit, or it may not, depending on the partial failure scenario.

客户端超时后重试请求
图 3-6。客户端超时后重试请求

存款可能发生两次对于客户来说是一个很好的结果,但银行不太可能对这种可能性感到好笑。因此,我们需要一种方法来确保在我们的服务器操作实现中重试来自客户端的重复请求只会导致该请求被应用一次。这对于维护正确的应用程序语义是必要的。

The chance that the deposit may occur twice is a fine outcome for the customer, but the bank is unlikely to be amused by this possibility. Therefore, we need a way to ensure in our server operations implementation that retried, duplicate requests from clients only result in the request being applied once. This is necessary to maintain correct application semantics.

此属性称为幂等性。幂等操作可以应用多次,而不会改变初始应用之外的结果。这意味着对于图 3-6中的示例,客户端可以多次重试请求,并且帐户只会增加 100 美元。

This property is known as idempotence. Idempotent operations can be applied multiple times without changing the result beyond the initial application. This means that for the example in Figure 3-6, the client can retry the request as many times as it likes, and the account will only be increased by $100.

不进行持久状态更改的请求自然是幂等的。这意味着所有读取请求本质上都是安全的,不需要在服务器上进行额外的工作。更新是另一回事。系统需要设计一种机制,使得重复的客户端请求不会导致任何状态变化,并且可以被服务器检测到。用 API 术语来说,这些端点会导致服务器状态发生变化,因此必须是幂等的。

Requests that make no persistent state changes are naturally idempotent. This means all read requests are inherently safe and no extra work is needed on the server. Updates are a different matter. The system needs to devise a mechanism such that duplicate client requests do not cause any state changes and can be detected by the server. In API terms, these endpoints cause mutation of the server state and must therefore be idempotent.

构建幂等操作的一般方法如下:

The general approach to building idempotent operations is as follows:

  • 客户包括所有改变状态的请求中唯一的幂等键。该键标识来自特定客户端或事件源的单个操作。它通常是用户标识符(例如会话密钥)和唯一值(例如本地时间戳、UUID 或序列号)的组合。

  • Clients include a unique idempotency key in all requests that mutate state. The key identifies a single operation from the specific client or event source. It is usually a composite of a user identifier, such as the session key, and a unique value such as a local timestamp, UUID, or a sequence number.

  • 当服务器收到请求时,它会通过从专门为实现幂等性而设计的数据库中读取内容来检查之前是否已看到幂等性键值。如果密钥不在数据库中,则这是一个新请求。因此,服务器执行业务逻辑来更新应用程序状态。它还将幂等性密钥存储在数据库中以指示操作已成功应用。

  • When the server receives a request, it checks to see if it has previously seen the idempotency key value by reading from a database that is uniquely designed for implementing idempotence. If the key is not in the database, this is a new request. The server therefore performs the business logic to update the application state. It also stores the idempotency key in a database to indicate that the operation has been successfully applied.

  • 如果幂等性密钥在数据库中,则表明该请求是来自客户端的重试,因此不应被处理。在这种情况下,服务器返回操作的有效响应,以便(希望)客户端不会再次重试。

  • If the idempotency key is in the database, this indicates that this request is a retry from the client and hence should not be processed. In this case the server returns a valid response for the operation so that (hopefully) the client won’t retry again.

用于存储幂等性密钥的数据库可以在例如:

The database used to store idempotency keys can be implemented in, for example:

  • 事务数据库中用于应用程序数据的单独数据库表或集合

  • A separate database table or collection in the transactional database used for the application data

  • 提供极低延迟查找的专用数据库,例如简单的键值存储

  • A dedicated database that provides very low latency lookups, such as a simple key-value store

与应用程序数据不同,幂等密钥不必永远保留。一旦客户端收到单个操作成功的确认,就可以丢弃幂等密钥。实现此目的的最简单方法是在特定时间段(例如 60 分钟或 24 小时)后自动从存储中删除幂等密钥,具体取决于应用程序需求和请求量。

Unlike application data, idempotency keys don’t have to be retained forever. Once a client receives an acknowledgment of a success for an individual operation, the idempotency key can be discarded. The simplest way to achieve this is to automatically remove idempotency keys from the store after a specific time period, such as 60 minutes or 24 hours, depending on application needs and request volumes.

此外,幂等 API 实现必须确保应用程序状态被修改存储幂等密钥。两者都必须同时发生才能成功。如果应用程序状态被修改,并且由于某些故障而未存储幂等密钥,则重试将导致该操作被应用两次。如果存储了幂等性密钥但由于某种原因应用程序状态没有被修改,则该操作尚未应用。如果重试到达,则会将其作为重复项过滤掉,因为幂等性键已存在,并且更新将丢失。

In addition, an idempotent API implementation must ensure that the application state is modified and the idempotency key is stored. Both must occur for success. If the application state is modified and, due to some failure, the idempotent key is not stored, then a retry will cause the operation to be applied twice. If the idempotency key is stored but for some reason the application state is not modified, then the operation has not been applied. If a retry arrives, it will be filtered out as duplicate as the idempotency key already exists, and the update will be lost.

这里的含义是应用程序状态和幂等性密钥存储的更新必须同时发生,或者都不发生。如果您了解您的数据库,您就会认识到这是事务语义的要求。我们将在第 12 章讨论分布式事务是如何实现的。本质上,事务确保了操作的一次性语义,这保证了所有消息将始终被精确处理一次——精确地我们需要幂等性。

The implication here is that the updates to the application state and idempotency key store must both occur, or neither must occur. If you know your databases, you’ll recognize this as a requirement for transactional semantics. We’ll discuss how distributed transactions are achieved in Chapter 12. Essentially, transactions ensure exactly-once semantics for operations, which guarantees that all messages will always be processed exactly once—precisely what we need for idempotence.

恰好一次并不意味着没有消息传输失败、重试和应用程序崩溃。这些都是不可避免的。重要的是重试最终会成功,并且结果总是相同的。

Exactly once does not mean that there are no message transmission failures, retries, and application crashes. These are all inevitable. The important thing is that the retries eventually succeed and the result is always the same.

我们将在后面的章节中回到通信交付保证的问题。如图3-7所示,存在一系列语义,每种语义都有不同的保证和性能特征。最多一次传送是快速且不可靠的——这就是 UDP 协议所提供的。至少一次传送是 TCP/IP 提供的保证,这意味着重复是不可避免的。正如我们在这里讨论的,一次性交付需要防止重复,因此会牺牲可靠性和性能下降。

We’ll return to the issue of communications delivery guarantees in later chapters. As Figure 3-7 illustrates, there’s a spectrum of semantics, each with different guarantees and performance characteristics. At-most-once delivery is fast and unreliable—this is what the UDP protocol provides. At-least-once delivery is the guarantee provided by TCP/IP, meaning duplicates are inevitable. Exactly-once delivery, as we’ve discussed here, requires guarding against duplicates and hence trades off reliability against slower performance.

通讯传输保证
图 3-7。通讯传输保证

正如我们将看到的,一些高级通信机制可以为我们的应用程序提供一次性语义。然而,由于性能影响,这些不能在互联网规模上运行。这就是为什么,由于我们的应用程序是基于 TCP/IP 的至少一次语义构建的,因此我们必须在 API 中实现恰好一次语义,导致状态突变。

As we’ll see, some advanced communications mechanisms can provide our applications with exactly-once semantics. However, these don’t operate at internet scale because of the performance implications. That is why, as our applications are built on the at-least-once semantics of TCP/IP, we must implement exactly-once semantics in our APIs that cause state mutation.

分布式系统中的共识

Consensus in Distributed Systems

碰撞故障还有另一个含义我们构建分布式系统的方式。图 3-8中描绘的二将军问题最好地说明了这一点。

Crash faults have another implication for the way we build distributed systems. This is best illustrated by the Two Generals’ Problem, which is depicted in Figure 3-8.

两位将军的问题
图 3-8。两位将军的问题

想象一下一座城市被两支军队围困。军队分布在城市的两侧,城市周围的地形很难穿过,而且城内的狙击手也能看到。为了攻克这座城市,两军同时进攻至关重要。这将增强城市的防御力,使攻击者更有可能取得胜利。如果只有一支军队进攻,那么他们很可能会被击退。

Imagine a city under siege by two armies. The armies lie on opposite sides of the city, and the terrain surrounding the city is difficult to travel through and visible to snipers in the city. In order to overwhelm the city, it’s crucial that both armies attack at the same time. This will stretch the city’s defenses and make victory more likely for the attackers. If only one army attacks, then they will likely be repelled.

考虑到这些限制,两位将军如何才能就确切的进攻时间达成一致,从而让两位将军确信已经达成协议?他们都需要确定对方军队会在约定的时间发起进攻,否则灾难就会随之而来。

Given these constraints, how can the two generals reach agreement on the exact time to attack, such that both generals know for certain that agreement has been reached? They both need certainty that the other army will attack at the agreed time, or disaster will ensue.

为了协调攻击,第一位将军向另一位将军派遣一名信使,指示在特定时间进攻。由于信使可能会被狙击手俘获或杀死,因此发送的将军无法确定消息已到达,除非他们得到第二位将军的确认信使。当然,确认的使者可能会被俘虏或被杀,所以即使原来的使者确实成功了,第一将也可能永远不会知道。即使确认消息到达,第二个将军如何知道这一点,除非他们得到第一个将军的确认?

To coordinate an attack, the first general sends a messenger to the other, with instructions to attack at a specific time. As the messenger may be captured or killed by snipers, the sending general cannot be certain the message has arrived unless they get an acknowledgment messenger from the second general. Of course, the acknowledgment messenger may be captured or killed, so even if the original messenger does get through, the first general may never know. And even if the acknowledgment message arrives, how does the second general know this, unless they get an acknowledgment from the first general?

希望问题是显而易见的。由于信使被随机捕获或消灭,无法保证两位将军能够就进攻时间达成一致。事实上,可以证明,无法保证达成协议。有一些解决方案可以增加达成共识的可能性。例如,《权力的游戏》风格中,每个将军每次可能会派出100个不同的信使,即使大多数人被杀,这也会增加至少一个信使冒险前往另一支友军并成功传递消息的概率。

Hopefully the problem is apparent. With messengers being randomly captured or extinguished, there is no guarantee the two generals will ever reach consensus on the attack time. In fact, it can be proven that it is not possible to guarantee agreement will be reached. There are solutions that increase the likelihood of reaching consensus. For example, Game of Thrones style, each general may send 100 different messengers every time, and even if most are killed, this increases the probability that at least one will make the perilous journey to the other friendly army and successfully deliver the message.

双将军问题类似于分布式系统中的两个节点希望就某些状态达成一致,例如可以在任一位置更新的数据项的值。部分失败类似于丢失消息和确认。消息可能会丢失或延迟一段不确定的时间——这是异步网络的特征,正如我在本章前面所描述的那样。

The Two Generals’ Problem is analogous to two nodes in a distributed system wishing to reach agreement on some state, such as the value of a data item that can be updated at either. Partial failures are analogous to losing messages and acknowledgments. Messages may be lost or delayed for an indeterminate period of time—the characteristics of asynchronous networks, as I described earlier in this chapter.

事实上,可以证明,在存在崩溃故障的情况下,消息可以延迟但不会丢失的异步网络上的共识是不可能在有限的时间内达成的。这称为 FLP 不可能定理。2

In fact it can be demonstrated that consensus on an asynchronous network in the presence of crash faults, where messages can be delayed but not lost, is impossible to achieve within bounded time. This is known as the FLP Impossibility Theorem.2

幸运的是,这只是理论上的限制,表明不可能保证异步网络上以无限的消息延迟达成共识。事实上,分布式系统始终会达成共识。这是可能的,因为虽然我们的网络是异步的,但我们可以对消息延迟建立合理的实际界限,并在超时后重试。因此,FLP 是最坏的情况,因此我将在第 12 章中讨论在分布式数据库中建立共识的算法。

Luckily, this is only a theoretical limitation, demonstrating it’s not possible to guarantee consensus will be reached with unbounded message delays on an asynchronous network. In reality, distributed systems reach consensus all the time. This is possible because while our networks are asynchronous, we can establish sensible practical bounds on message delays and retry after a timeout period. FLP is therefore a worst-case scenario, and as such I’ll discuss algorithms for establishing consensus in distributed databases in Chapter 12.

最后,我们应该注意拜占庭失败的问题。想象一下将两位将军的问题扩展到需要就进攻时间达成一致的N位将军。然而,在这种情况下,叛徒信使可能会改变攻击时间的值,或者叛将可能会向其他将军发送虚假信息。

Finally, we should note the issue of Byzantine failures. Imagine extending the Two Generals’ Problem to N generals who need to agree on a time to attack. However, in this scenario, traitorous messengers may change the value of the time of the attack, or a traitorous general may send false information to other generals.

这类恶意故障被称为拜占庭故障,在分布式系统中尤其险恶。幸运的是,我们在本书中讨论的系统通常位于受良好保护的、安全的企业网络和管理环境后面。这意味着我们实际上可以排除处理拜占庭错误。确实存在解决此类恶意行为的算法,如果您对实际示例感兴趣,请采取看看区块链共识机制比特币

This class of malicious failures are known as Byzantine faults and are particularly sinister in distributed systems. Luckily, the systems we discuss in this book typically live behind well-protected, secure enterprise networks and administrative environments. This means we can in practice exclude handling Byzantine faults. Algorithms that do address such malicious behaviors exist, and if you are interested in a practical example, take a look at blockchain consensus mechanisms and Bitcoin.

分布式系统中的时间

Time in Distributed Systems

分布式系统中的每个节点都有自己的内部时钟。如果每台机器上的所有时钟都完美的话通过同步,我们总是可以简单地比较跨节点事件的时间戳,以确定它们发生的精确顺序。如果这是现实,我将讨论的分布式系统的许多问题几乎都会消失。

Every node in a distributed system has its own internal clock. If all the clocks on every machine were perfectly synchronized, we could always simply compare the timestamps on events across nodes to determine the precise order they occurred in. If this were reality, many of the problems I’ll discuss with distributed systems would pretty much go away.

不幸的是,这种情况并非如此。由于环境因素,各个节点上的时钟会发生漂移温度或电压变化等条件。每台机器的漂移量各不相同,但每天 10-20 秒等数值并不少见。(或者用我家里现在的咖啡机,每天大约 5 分钟!)

Unfortunately, this is not the case. Clocks on individual nodes drift due to environmental conditions like changes in temperature or voltage. The amount of drift varies on every machine, but values such as 10–20 seconds per day are not uncommon. (Or with my current coffee machine at home, about 5 minutes per day!)

如果不加以控制,时钟漂移将使节点上的时间变得毫无意义——就像我的咖啡机上的时间,如果我不每隔几天纠正一次。为了解决这个问题,存在多种时间服务。时间服务代表准确的时间源,例如 GPS 或原子钟,可用于定期重置节点上的时钟,以纠正数据包交换、可变延迟数据网络上的漂移。

If left unchecked, clock drift would render the time on a node meaningless—like the time on my coffee machine if I don’t correct it every few days. To address this problem, a number of time services exist. A time service represents an accurate time source, such as a GPS or atomic clock, which can be used to periodically reset the clock on a node to correct for drift on packet-switched, variable-latency data networks.

使用最广泛的时间服务是网络时间协议 (NTP),它提供跨越全球的按层次结构组织的时间服务器集合。根服务器在全球范围内约有 300 个,是最准确的。层次结构的下一级(大约 20,000 个)中的时间服务器会定期与根服务器的几毫秒内同步,在整个层次结构中依此类推,最多 15 级。全球有超过 175,000 个 NTP 服务器。

The most widely used time service is Network Time Protocol (NTP), which provides a hierarchically organized collection of time servers spanning the globe. The root servers, of which there are around 300 worldwide, are the most accurate. Time servers in the next level of the hierarchy (approximately 20,000) synchronize to within a few milliseconds of the root server periodically, and so on throughout the hierarchy, with a maximum of 15 levels. Globally, there are more than 175,000 NTP servers.

使用 NTP 协议,运行 NTP 客户端的应用程序中的节点可以同步到 NTP 服务器。节点上的时间是通过与一个或多个 NTP 服务器交换 UDP 消息来设置的。消息带有时间戳,并且通过消息交换来估计消息传输所花费的时间。这成为 NTP 使用的算法中的一个因素,用于确定客户端上的时间应重置为什么。一个简单的NTP配置如图3-9所示。在 LAN 上,机器可以在少量毫秒的精度内同步到 NTP 服务器。

Using the NTP protocol, a node in an application running an NTP client can synchronize to an NTP server. The time on a node is set by a UDP message exchange with one or more NTP servers. Messages are time stamped, and through the message exchange the time taken for message transit is estimated. This becomes a factor in the algorithm used by NTP to establish what the time on the client should be reset to. A simple NTP configuration is shown in Figure 3-9. On a LAN, machines can synchronize to an NTP server within a small number of milliseconds accuracy.

NTP 同步对我们的应用程序的一个有趣的影响是时钟的重置可以使本地节点时间向前或向后移动。NTP这意味着,如果我们的应用程序正在测量事件发生所需的时间(例如,计算事件响应时间),如果协议设置了本地时间,则事件的结束时间可能早于开始时间落后。

One interesting effect of NTP synchronization for our applications is that the resetting of the clock can move the local node time forward or backward. This means that if our application is measuring the time taken for events to occur (e.g., to calculate event response times), it is possible that the end time of the event may be earlier than the start time if the NTP protocol has set the local time backward.

说明如何使用 NTP 服务
图 3-9。说明如何使用 NTP 服务

事实上,一个计算节点有两个时钟。这些都是:

In fact, a compute node has two clocks. These are:

一天中的时间时钟
Time of day clock
这代表了数量自 1970 年 1 月 1 日午夜以来的毫秒数。在 Java 中,您可以使用 获取当前时间System.currentTimeMillis()。这是可以由 NTP 重置的时钟,因此如果它比 NTP 时间落后或提前很远,则可能会向前或向后跳跃。
This represents the number of milliseconds since midnight, January 1st 1970. In Java, you can get the current time using System.currentTimeMillis(). This is the clock that can be reset by NTP, and hence may jump forward or backward if it is a long way behind or ahead of NTP time.
单调时钟
Monotonic clock
这表示自过去未指定的点以来的时间量(以秒和纳秒为单位),例如上次系统重新启动的时间。它只会永远向前;然而,它也可能不是对经过时间的完全准确的测量,因为它在虚拟机挂起等事件期间停止。在 Java 中,您可以使用 获取当前单调时钟时间Sys⁠tem.nanoTime()
This represents the amount of time (in seconds and nanoseconds) since an unspecified point in the past, such as the last time the system was restarted. It will only ever move forward; however, it again may not be a totally accurate measure of elapsed time because it stalls during an event such as virtual machine suspension. In Java, you can get the current monotonic clock time using Sys⁠tem.nanoTime().

应用程序可以使用 NTP 服务来确保系统中每个节点上的时钟紧密同步。应用程序通常会以一小时到一天的时间间隔重新同步时钟。这确保了时钟的值保持接近。尽管如此,如果应用程序确实需要精确地知道不同节点上发生的事件的顺序,时钟漂移将使这充满危险。

Applications can use an NTP service to ensure the clocks on every node in the system are closely synchronized. It’s typical for an application to resynchronize clocks on anything from a one hour to one day time interval. This ensures the clocks remain close in value. Still, if an application really needs to precisely know the order of events that occur on different nodes, clock drift is going to make this fraught with danger.

还有其他时间服务可以提供比 NTP 更高的精度。Chrony支持 NTP 协议,但提供比 NTP 更高的准确性和更大的可扩展性,这也是Facebook 采用它的原因。亚马逊通过在其数据中心安装 GPS 和原子钟构建了亚马逊时间同步服务。该服务免费向所有 AWS 客户开放。

There are other time services that provide higher accuracy than NTP. Chrony supports the NTP protocol but provides much higher accuracy and greater scalability than NTP—the reason it has been adopted by Facebook. Amazon has built the Amazon Time Sync Service by installing GPS and atomic clocks in its data centers. This service is available for free to all AWS customers.

这次讨论的要点是,我们的应用程序不能依赖不同节点上的事件时间戳来表示这些事件的实际顺序。即使时钟漂移一两秒,跨节点时间戳也变得毫无意义进行比较。这的影响当我们开始详细讨论分布式数据库时就会变得清楚。

The takeaway from this discussion is that our applications cannot rely on timestamps of events on different nodes to represent the actual order of these events. Clock drift even by a second or two makes cross-node timestamps meaningless to compare. The implications of this will become clear when we start to discuss distributed databases in detail.

总结和延伸阅读

Summary and Further Reading

本章涵盖了大量内容来解释分布式系统中通信和时间的一些基本特征。这些特征对于应用程序设计者和开发者来说非常重要。

This chapter has covered a lot of ground to explain some of the essential characteristics of communications and time in distributed systems. These characteristics are important for application designers and developers to understand.

本章应引起共鸣的关键问题如下:

The key issues that should resonate from this chapter are as follows:

  1. 分布式系统中的通信可以透明地穿越许多不同类型的底层物理网络,包括 WiFi、无线、WAN 和 LAN。因此,通信延迟变化很大,并受到节点之间的物理距离、物理网络属性和瞬时网络拥塞的影响。在大规模情况下,应用程序组件之间的延迟应该尽可能地最小化(当然,在物理定律的范围内)。

  2. Communications in distributed systems can transparently traverse many different types of underlying physical networks, including WiFi, wireless, WANs, and LANs. Communication latencies are hence highly variable, and influenced by the physical distance between nodes, physical network properties, and transient network congestion. At large scale, latencies between application components are something that should be minimized as much as possible (within the laws of physics, of course).

  3. 互联网协议栈通过 IP 和 TCP 协议的组合确保跨异构网络的可靠通信。由于网络通信结构和路由器故障导致节点不可用,以及单个节点故障,通信可能会失败。您的代码将经历各种 TCP/IP 开销,例如连接建立以及网络故障发生时的错误。因此,了解 IP 套件的基础知识对于设计和调试非常重要。

  4. The Internet Protocol stack ensures reliable communications across heterogeneous networks through a combination of the IP and TCP protocols. Communications can fail due to network communications fabric and router failures that make nodes unavailable, as well as individual node failure. Your code will experience various TCP/IP overheads, for example, for connection establishment, and errors when network failures occur. Hence, understanding the basics of the IP suite is important for design and debugging.

  5. RMI/RPC 技术构建了 TCP/IP 层,为客户端/服务器通信提供抽象,从而镜像本地方法/过程调用。然而,这些更抽象的编程方法仍然需要能够适应网络问题,例如故障和重传。这在服务器上改变状态的应用程序 API 中最为明显,并且必须设计为幂等的。

  6. RMI/RPC technologies build the TCP/IP layer to provide abstractions for client/server communications that mirror making local method/procedure calls. However, these more abstract programming approaches still need to be resilient to network issues such as failures and retransmissions. This is most apparent in application APIs that mutate state on the server, and must be designed to be idempotent.

  7. 在异步网络上,在存在崩溃故障的情况下,不可能在有限时间内就多个节点之间的状态达成一致或共识。幸运的是,真实的网络,尤其是 LAN,速度快且可靠,这意味着我们可以设计出在实践中达成共识的算法。当我们讨论分布式数据库时,我将在本书的第三部分中介绍这些内容。

  8. Achieving agreement, or consensus on state across multiple nodes in the presence of crash faults is not possible in bounded time on asynchronous networks. Luckily, real networks, especially LANs, are fast and mostly reliable, meaning we can devise algorithms that achieve consensus in practice. I’ll cover these in Part III of the book when we discuss distributed databases.

  9. 应用程序中的节点不存在可以依赖的可靠的全局时间源来同步其行为。各个节点上的时钟各不相同,无法用于有意义的比较。这意味着应用程序无法有意义地比较不同节点上的时钟来确定事件的顺序。

  10. There is no reliable global time source that nodes in an application can rely upon to synchronize their behavior. Clocks on individual nodes vary and cannot be used for meaningful comparisons. This means applications cannot meaningfully compare clocks on different nodes to determine the order of events.

这些问题将贯穿本书其余部分的讨论。分布式系统中采用的许多独特问题和解决方案都源于这些基本原理。他们无法逃脱!

These issues will pervade the discussions in the rest of this book. Many of the unique problems and solutions that are adopted in distributed systems stem from these fundamentals. There’s no escaping them!

George Coulouris 等人的《分布式系统:概念与设计》第 5 版是对分布式系统各个方面进行更详细、更理论性介绍的绝佳来源。(皮尔逊,2001)。

An excellent source for more detailed, more theoretical coverage of all aspects of distributed systems is George Coulouris et al., Distributed Systems: Concepts and Design, 5th ed. (Pearson, 2001).

同样,对于计算机网络,您将在 James Kurose 和 Keith Ross 的《计算机网络:自上而下的方法》第 7版中找到您想了解的所有信息,毫无疑问还有更多信息。(皮尔逊,2017)。

Likewise for computer networking, you’ll find out all you wanted to know and no doubt more in James Kurose and Keith Ross’s Computer Networking: A Top-Down Approach, 7th ed. (Pearson, 2017).

1 Roy T. Fielding,“架构风格和基于网络的软件架构的设计”。论文,加州大学欧文分校,2000 年。

1 Roy T. Fielding, “Architectural Styles and the Design of Network-Based Software Architectures.” Dissertation, University of California, Irvine, 2000.

2 Michael J. Fischer 等人,“通过一个错误的流程不可能达成分布式共识”, ACM 杂志32,第 1 期。2(1985):374-82。https://doi.org/10.1145/3149.214121

2 Michael J. Fischer et al., “Impossibility of Distributed Consensus with One Faulty Process,” Journal of the ACM 32, no. 2 (1985): 374–82. https://doi.org/10.1145/3149.214121.

第 4 章并发系统概述

Chapter 4. An Overview of Concurrent Systems

分布式系统包含在多个位置的许多处理节点上并行或同时执行的多个独立代码段。因此,根据定义,任何分布式系统都是并发系统,即使每个节点一次处理一个事件。当然,必须协调各个节点的行为,以使应用程序按预期运行。

Distributed systems comprise multiple independent pieces of code executing in parallel, or concurrently, on many processing nodes across multiple locations. Any distributed system is hence by definition a concurrent system, even if each node is processing events one at a time. The behavior of the various nodes must of course be coordinated in order to make the application behave as desired.

正如我在第 3 章中所描述的,协调分布式系统中的节点充满了危险。幸运的是,我们的行业已经足够成熟,可以提供复杂、强大的软件框架,这些框架可以从我们的应用程序中隐藏许多分布式系统的危险(无论如何,大多数时候)。本书的大部分内容侧重于描述如何利用这些框架来构建可扩展的分布式系统。

As I described in Chapter 3, coordinating nodes in a distributed system is fraught with danger. Luckily, our industry has matured sufficiently to provide complex, powerful software frameworks that hide many of these distributed system perils from our applications (most of the time, anyway). The majority of this book focuses on describing how we can utilize these frameworks to build scalable distributed systems.

然而,本章关注的是我们系统中单个节点上的并发行为。通过显式编写我们的软件来同时执行多个操作,我们可以优化单个节点上的处理和资源利用率,从而提高本地和系统范围内的处理能力。

This chapter, however, is concerned with concurrent behavior in our systems on a single node. By explicitly writing our software to perform multiple actions concurrently, we can optimize the processing and resource utilization on a single node, and hence increase our processing capacity both locally and system-wide.

我将使用 Java 7.0 并发性例如,这些功能的抽象级别低于 Java 8.0 中引入的功能。了解并发系统如何“更接近机器”运行是构建并发和分布式系统时必不可少的基础知识。一旦您了解了构建并发系统的较低级别的机制,更抽象的方法就更容易优化利用。虽然本章是针对 Java 的,但是当您用其他语言编写系统时,并发系统的基本问题不会改变。所有主流编程语言中都存在处理并发的机制。“并发模型”提供了有关替代方法以及如何用现代语言实现它们的更多详细信息。

I’ll use the Java 7.0 concurrency capabilities for examples, as these are at a lower level of abstraction than those introduced in Java 8.0. Knowing how concurrent systems operate “closer to the machine” is essential foundational knowledge when building concurrent and distributed systems. Once you understand the lower-level mechanisms for building concurrent systems, the more abstract approaches are easier to optimally exploit. And while this chapter is Java-specific, the fundamental problems of concurrent systems don’t change when you write systems in other languages. Mechanisms for handling concurrency exist in all mainstream programming languages. “Concurrency Models” gives some more details on alternative approaches and how they are implemented in modern languages.

最后一点。本章是并发入门。它不会教您构建复杂的高性能并发系统所需的一切。如果您编写并发程序的经验很生疏,或者您接触过其他编程语言的并发代码,那么它也会很有用。本章末尾的进一步阅读部分为那些希望深入研究的人提供了对该主题的更全面的介绍。

One final point. This chapter is a concurrency primer. It won’t teach you everything you need to know to build complex, high-performance concurrent systems. It will also be useful if your experience writing concurrent programs is rusty, or you have some exposure to concurrent code in another programming language. The further reading section at the end of the chapter points to more comprehensive coverage of this topic for those who wish to delve deeper.

为什么要并发?

Why Concurrency?

想象一下一家繁忙的咖啡店。如果每个人都点一杯简单的咖啡,那么咖啡师可以快速、一致地提供咖啡每喝一杯。突然,你前面的人点了一杯大豆、香草、无糖、四杯冰镇啤酒。排队的每个人都叹了口气,开始阅读他们的社交媒体。两分钟后队伍就排到了门外。

Think of a busy coffee shop. If everyone orders a simple coffee, then the barista can quickly and consistently deliver each drink. Suddenly, the person in front of you orders a soy, vanilla, no sugar, quadruple-shot iced brew. Everyone in line sighs and starts reading their social media. In two minutes the line is out of the door.

Web 应用程序中的处理请求类似于我们的咖啡示例。在一家咖啡店,我们寻求新咖啡师的帮助,在不同的机器上同时制作咖啡,以控制生产线长度并快速为顾客提供服务。在软件中,为了使应用程序具有响应能力,我们需要以某种方式以重叠的方式处理服务器中的请求,同时处理请求。

Processing requests in web applications is analogous to our coffee example. In a coffee shop, we enlist the help of a new barista to simultaneously make coffees on a different machine to keep the line length in control and serve customers quickly. In software, to make applications responsive, we need to somehow process requests in our server in an overlapping manner, handling requests concurrently.

在过去的计算时代,每个 CPU 在任何时刻都只能执行一条机器指令。如果我们的服务器应用程序在这样的 CPU 上运行,为什么我们需要构建我们的软件系统以可能同时执行多个指令?这一切似乎都有点毫无意义。

In the good old days of computing, each CPU was only able to execute a single machine instruction at any instant. If our server application runs on such a CPU, why do we need to structure our software systems to potentially execute multiple instructions concurrently? It all seems slightly pointless.

实际上有一个很好的理由。事实上,每个程序不仅仅执行机器指令。例如,当程序尝试读取文件或在网络上发送消息时,它必须与 CPU 外围的硬件子系统(磁盘、网卡)进行交互。从磁硬盘读取数据大约需要 10 毫秒 (ms)。在此期间,程序必须等待数据可供处理。

There is actually a very good reason. Virtually every program does more than just execute machine instructions. For example, when a program attempts to read from a file or send a message on the network, it must interact with the hardware subsystem (disk, network card) that is peripheral to the CPU. Reading data from a magnetic hard disk takes around 10 milliseconds (ms). During this time, the program must wait for the data to be available for processing.

现在,即使是古老的 CPU,例如1988 年的 Intel 80386,也可以每秒执行超过 1000 万条指令 (mips) 。10 毫秒是百分之一秒。我们的 80386 在百分之一秒内可以执行多少条指令?算一算。(提示——很多!)事实上,浪费了很多处理能力。

Now, even an ancient CPU such as a 1988 Intel 80386 can execute more than 10 million instructions per second (mips). 10 ms is one hundredth of a second. How many instructions could our 80386 execute in a hundredth second? Do the math. (Hint—it’s a lot!) A lot of wasted processing capacity, in fact.

这就是 Linux 等操作系统可以在单个 CPU 上运行多个程序的方式。当一个程序等待 I/O 事件时,操作系统会调度另一个程序来执行。通过显式地构造我们的软件以具有可以并行执行的多个活动,操作系统可以安排有工作要做的任务,而其他任务则等待 I/O。我们将在本章后面更详细地了解它如何与 Java 一起工作。

This is how operating systems such as Linux can run multiple programs on a single CPU. While one program is waiting for an I/O event, the operating system schedules another program to execute. By explicitly structuring our software to have multiple activities that can be executed in parallel, the operating system can schedule tasks that have work to do while others wait for I/O. We’ll see in more detail how this works with Java later in this chapter.

2001 年,IBM 推出了世界上第一个多核处理器,即具有两个CPU 的芯片 —请参阅图 4-1的简化说明。如今,就连我的笔记本电脑也有 16 个 CPU,即俗称的“核心”。使用多核芯片,可以执行具有多个并行活动的软件系统每个核心上同时运行,最多可达可用核心的数量。这样,我们就可以充分利用多核芯片上的处理资源,从而提高我们应用程序的处理能力。

In 2001, IBM introduced the world’s first multicore processor, a chip with two CPUs—see Figure 4-1 for a simplified illustration. Today, even my laptop has 16 CPUs, or “cores,” as they are commonly known. With a multicore chip, a software system that is structured to have multiple parallel activities can be executed concurrently on each core, up to the number of available cores. In this way, we can fully utilize the processing resources on a multicore chip, and thus increase our application’s processing capacity.

多核处理器的简化视图
图 4-1。多核处理器的简化视图

将软件系统构建为并发活动的主要方法是使用线程。几乎每一个编程语言有自己的线程机制。所有这些机制的底层语义都是相似的——主流使用的主要线程模型只有少数——但显然语法因语言而异。在下面的部分中,我将解释 Java 中如何支持线程,以及如何设计我们的程序,使其在并行执行时安全(即正确)和高效。有了这些知识,跳入其他语言支持的并发特性应该不会太困难。

The primary way to structure a software system as concurrent activities is to use threads. Virtually every programming language has its own threading mechanism. The underlying semantics of all these mechanisms are similar—there are only a few primary threading models in mainstream use—but obviously the syntax varies by language. In the following sections, I’ll explain how threads are supported in Java, and how we need to design our programs to be safe (i.e., correct) and efficient when executing in parallel. Armed with this knowledge, leaping into the concurrency features supported in other languages shouldn’t be too arduous.

线程数

Threads

每个软件进程都有一个默认执行线程。这是操作系统在调度进程执行时管理的线程。例如,在 Java 中,main()您指定为代码入口点的函数定义了该线程的行为。该单个线程可以访问程序的环境和资源,例如打开的文件句柄和网络连接。当程序调用代码中实例化的对象中的方法时,程序的运行时堆栈用于传递参数和管理变量范围。标准编程语言运行时的东西,我们都知道并且喜欢。这是一个顺序过程。

Every software process has a single thread of execution by default. This is the thread that the operating system manages when it schedules the process for execution. In Java, for example, the main() function you specify as the entry point to your code defines the behavior of this thread. This single thread has access to the program’s environment and resources such as open file handles and network connections. As the program calls methods in objects instantiated in the code, the program’s runtime stack is used to pass parameters and manage variable scopes. Standard programming language runtime stuff, that we all know and love. This is a sequential process.

在您的系统中,您可以使用编程语言功能来创建和执行其他线程。每个线程都是一个独立的执行序列,并且有自己的运行时堆栈来管理本地对象创建和方法调用。每个线程还可以访问进程的全局数据和环境。该方案的简单描述如图 4-2所示。

In your systems, you can use programming language features to create and execute additional threads. Each thread is an independent sequence of execution and has its own runtime stack to manage local object creation and method calls. Each thread also has access to the process’ global data and environment. A simple depiction of this scheme is shown in Figure 4-2.

比较单线程和多线程进程
图 4-2。比较单线程和多线程进程

在Java中,我们可以使用a来定义线程实现Runnable接口并定义run()方法的类。让我们看一个简单的例子:

In Java, we can define a thread using a class that implements the Runnable interface and defines the run() method. Let’s look at a simple example:

类 NamingThread 实现Runnable{

私有字符串名称;
       
公共命名线程(字符串线程名){
      名称 = 线程名称;
           System.out.println("构造函数调用:" + threadName) ;
       }
        
       公共无效run(){
      //显示该线程的信息
           System.out.println("运行名为:" + name);
           System.out.println(name + " : " + Thread.currentThread());
           // 现在终止....
    }
}
class NamingThread implements Runnable {

private String name;
       
public NamingThread(String threadName) {
      name = threadName ;
           System.out.println("Constructor called: " + threadName) ;
       }
        
       public void run() {
      //Display info about this  thread
           System.out.println("Run called : " + name);
           System.out.println(name + " : " + Thread.currentThread());
           // and now terminate  ....
    }
}

为了执行线程,我们需要Thread使用 our 的实例构造一个对象Runnable,并调用该start()方法以在其自己的执行上下文中调用代码。这将在下一个代码示例中显示,并以粗体文本显示运行代码的输出。请注意,此示例有两个线程:main()线程和NamingThread. 主线程启动NamingThread异步执行的 ,然后等待 1 秒,以便我们的run()方法有NamingThread足够的时间完成:

To execute the thread, we need to construct a Thread object using an instance of our Runnable and call the start() method to invoke the code in its own execution context. This is shown in the next code example, along with the output of running the code in bold text. Note this example has two threads: the main() thread and the NamingThread. The main thread starts the NamingThread, which executes asynchronously, and then waits for 1 second to give our run() method in NamingThread ample time to complete:

公共静态无效主(字符串[] args){
      
  NamingThread name0 = new NamingThread("我的第一个线程");
    
  //创建线程
  线程 t0 = new Thread(name0);
    
  // 启动线程
  t0.start();
      
  //延迟主线程一秒(1000毫秒)
  尝试 {
    Thread.currentThread().sleep(1000);
  } catch (InterruptedException e) {}

      //显示主线程信息并终止
      System.out.println(Thread.currentThread());
    
}

===执行输出===
构造函数调用:我的第一个线程
运行名为:我的第一个线程
我的第一个线程:Thread[Thread-0,5,main]
线程[主,5,主]
public static void main(String[] args) {
      
  NamingThread name0 = new NamingThread("My first thread");
    
  //Create the thread
  Thread t0 = new Thread (name0);
    
  // start the threads
  t0.start();
      
  //delay the main thread for a second (1000 milliseconds)
  try {
    Thread.currentThread().sleep(1000);
  } catch (InterruptedException e) {}

      //Display info about the main thread and terminate
      System.out.println(Thread.currentThread());
    
}

===EXECUTION OUTPUT===
Constructor called: My first thread
Run called : My first thread
My first thread : Thread[Thread-0,5,main]
Thread[main,5,main]

为了便于说明,我们也称静态currentThread()方法,返回一个字符串,其中包含:

For illustration, we also call the static currentThread() method, which returns a string containing:

  • 系统生成的线程标识符。

  • The system-generated thread identifier.

  • 线程优先级,所有线程默认为 5。稍后我们将介绍线程优先级。

  • The thread priority, which by default is 5 for all threads. We’ll cover thread priorities later.

  • 父线程的标识符——在此示例中,两个父线程都是main thread.

  • The identifier of the parent thread—in this example both parent threads are the main thread.

请注意,要实例化线程,我们调用start()方法,而不是run()我们在Runnable. 该start()方法包含内部系统魔法,用于为要执行的单独线程创建执行上下文。如果我们run()直接调用,代码将会执行,但不会创建新的线程。该run()方法将作为main线程的一部分执行,就像您了解和喜爱的任何其他 Java 方法调用一样。您仍然会有单线程代码。

Note that to instantiate a thread, we call the start() method, not the run() method we define in the Runnable. The start() method contains the internal system magic to create the execution context for a separate thread to execute. If we call run() directly, the code will execute, but no new thread will be created. The run() method will execute as part of the main thread, just like any other Java method invocation that you know and love. You will still have a single-threaded code.

在示例中,我们使用sleep()暂停线程的执行main并确保它不会在NamimgThread. 这种方法,即通过延迟绝对时间段(示例中为 1 秒)来协调两个线程,并不是一种非常稳健的机制。如果由于某种原因(CPU 速度较慢、磁盘读取延迟较长、方法中额外的复杂逻辑)我们的线程没有在预期的时间范围内终止怎么办?在这种情况下,main将首先终止——这不是我们的意图。一般来说,如果您使用绝对时间来进行线程协调,那么您就做错了。几乎总是。99.99999% 的时间都是如此。

In the example, we use sleep() to pause the execution of the main thread and make sure it does not terminate before the NamimgThread. This approach, namely coordinating two threads by delaying for an absolute time period (1 second in the example) is not a very robust mechanism. What if for some reason—a slower CPU, a long delay reading disk, additional complex logic in the method—our thread doesn’t terminate in the expected timeframe? In this case, main will terminate first—this is not what we intend. In general, if you are using absolute times for thread coordination, you are doing it wrong. Almost always. Like 99.99999% of the time.

使用该方法是一种简单而可靠的机制,可以让一个线程等待另一个线程完成其工作join()。我们可以将try-catch上面示例中的块替换为:

A simple and robust mechanism for one thread to wait until another has completed its work is to use the join() method. We could replace the try-catch block in the above example with:

t0.join();
t0.join();

此方法导致调用线程(在本例中为main)阻塞,直到 引用的线程t0终止。如果引用的线程在调用 之前已终止join(),则方法调用立即返回。通过这种方式,我们可以协调或同步多个线程的行为。多线程的同步实际上是其余的主要焦点本章的。

This method causes the calling thread (in this case, main) to block until the thread referenced by t0 terminates. If the referenced thread has terminated before the call to join(), then the method call returns immediately. In this way we can coordinate, or synchronize, the behavior of multiple threads. Synchronization of multiple threads is in fact the major focus of the rest of this chapter.

线程执行顺序

Order of Thread Execution

系统调度程序(在 Java 中,这就是在 Java 虚拟机 [JVM] 中)控制线程执行的顺序。从程序员的角度来看,执行顺序是不确定的。习惯这个词,我会经常使用它。不确定性的概念是理解多线程代码的基础。

The system scheduler (in Java, this lives in the Java virtual machine [JVM]) controls the order of thread execution. From the programmer’s perspective, the order of execution is nondeterministic. Get used to that term, I’ll use it a lot. The concept of nondeterminism is fundamental to understanding multithreaded code.

我将通过前面的示例来说明这一点NamingThreadNamingThread我将创建并启动一些,而不是创建一个。三、其实如下面的代码示例所示。同样,运行代码的示例输出在代码本身下方以粗体文本显示:

I’ll illustrate this by building on the earlier NamingThread example. Instead of creating a single NamingThread, I’ll create and start up a few. Three, in fact, as shown in the following code example. Again, sample output from running the code is in bold text beneath the code itself:

      NamingThread name0 = new NamingThread("thread0");
      NamingThread name1 = new NamingThread("线程1");
      NamingThread name2 = new NamingThread("thread2");
    
      //创建线程
      线程 t0 = new Thread(name0);
      线程 t1 = new Thread(name1);
      线程 t2 = new Thread(name2);    
      
      // 启动线程
      t0.start();  1
      t1.start();  1
      t2.start();  1

===EXECUTION OUTPUT===
Run called : thread0
thread0 : Thread[Thread-0,5,main]  2
Run called : thread2  3
Run called : thread1
thread1 : Thread[Thread-1,5,main]  4
thread2 : Thread[Thread-2,5,main]
Thread[main,5,main]
      NamingThread name0 = new NamingThread("thread0");
      NamingThread name1 = new NamingThread("thread1");
      NamingThread name2 = new NamingThread("thread2");
    
      //Create the threads
      Thread t0 = new Thread (name0);
      Thread t1 = new Thread (name1);
      Thread t2 = new Thread (name2);    
      
      // start the threads
      t0.start();  
      t1.start();  
      t2.start();  

===EXECUTION OUTPUT===
Run called : thread0
thread0 : Thread[Thread-0,5,main]  
Run called : thread2  
Run called : thread1
thread1 : Thread[Thread-1,5,main]  
thread2 : Thread[Thread-2,5,main]
Thread[main,5,main]

显示的输出只是一次执行的示例。您可以看到代码按顺序启动了三个线程,即t0t1t2(请参阅1)。查看输出,我们看到线程t02在其他线程开始之前完成(请参阅)。接下来调用t2的方法(请参阅),然后调用t1的方法,即使t1在t2之前启动。然后,线程t1在t2之前运行完成(请参阅),并最终线程和程序终止。run()3run()4

The output shown is a sample from just one execution. You can see the code starts three threads sequentially, namely t0, t1, and t2 (see ). Looking at the output, we see thread t0 completes (see ) before the others start. Next t2’s run() method is called (see ) followed by t1’s run() method, even though t1 was started before t2. Thread t1 then runs to completion (see ) before t2, and eventually the main thread and the program terminate.

这只是一种可能的执行顺序。如果我们再次运行这个程序,我们几乎肯定会看到不同的执行跟踪。这是因为 JVM 调度程序正在决定执行哪个线程以及执行多长时间。简而言之,一旦调度程序为某个线程提供了 CPU 上的执行时隙,它就可以在指定时间段后中断该线程并调度另一个线程运行。这种中断称为抢占。抢占确保每个线程都有机会取得进展。因此,线程独立且异步地运行直到完成,并且调度程序根据调度算法决定运行哪个线程。

This is just one possible order of execution. If we run this program again, we will almost certainly see a different execution trace. This is because the JVM scheduler is deciding which thread to execute, and for how long. Put very simply, once the scheduler has given a thread an execution time slot on a CPU, it can interrupt the thread after a specified time period and schedule another one to run. This interruption is known as preemption. Preemption ensures each thread is given an opportunity to make progress. Hence the threads run independently and asynchronously until completion, and the scheduler decides which thread runs when based on a scheduling algorithm.

线程调度的内容远不止这些,我将在本章后面解释使用的基本调度算法。目前,这对程序员来说有一个重大影响:无论线程执行的顺序如何(您无法控制),您的代码都应该产生正确的结果结果。听起来很容易?请继续阅读。

There’s more to thread scheduling than this, and I’ll explain the basic scheduling algorithm used later in this chapter. For now, there is a major implication for programmers; regardless of the order of thread execution—which you don’t control—your code should produce correct results. Sounds easy? Read on.

线程问题

Problems with Threads

并发编程的基本问题是协调多个线程的执行,以便无论它们以什么顺序执行,它们都会产生正确的答案。鉴于线程可以不确定地启动和抢占,任何中等复杂的程序本质上都将具有无限数量的可能的执行顺序。这些系统不容易测试。

The basic problem in concurrent programming is coordinating the execution of multiple threads so that whatever order they are executed in, they produce the correct answer. Given that threads can be started and preempted nondeterministically, any moderately complex program will have essentially an infinite number of possible orders of execution. These systems aren’t easy to test.

所有并发程序都需要避免两个基本问题。这些是竞争条件和死锁,这些主题将在接下来的两小节中介绍。

There are two fundamental problems that all concurrent programs need to avoid. These are race conditions and deadlocks, and these topics are covered in the next two subsections.

竞赛条件

Race Conditions

线程的非确定性执行意味着组成线程的代码语句:

Nondeterministic execution of threads implies that the code statements that comprise the threads:

  • 将按照每个线程中的定义顺序执行。

  • Will execute sequentially as defined within each thread.

  • 可以跨线程以任意顺序重叠。这是因为每个线程执行槽执行的语句数量是由调度程序确定的。

  • Can be overlapped in any order across threads. This is because the number of statements that are executed for each thread execution slot is determined by the scheduler.

因此,当执行多个线程时在单个处理器上,它们的执行是交错的。CPU 从一个线程执行一些步骤,然后从另一个线程执行一些步骤,依此类推。如果我们在多核 CPU 上执行,则每个核心可以执行一个线程。然而,每个线程执行的语句仍然以不确定的方式交错。

Hence, when many threads are executed on a single processor, their execution is interleaved. The CPU executes some steps from one thread, then performs some steps from another, and so on. If we are executing on a multicore CPU, then we can execute one thread per core. The statements of each thread execution are still however interleaved in a nondeterministic manner.

现在,如果每个线程只是做自己的事情并且完全独立,那么这不是问题。每个线程都会执行直到终止,如我们的简单Naming​Thread示例所示。这东西简直就是小菜一碟!为什么这些线程的东西注定是复杂的?

Now, if every thread simply does its own thing and is completely independent, this is not a problem. Each thread executes until it terminates, as in our trivial Naming​Thread example. This stuff is a piece of cake! Why are these thread things meant to be complex?

不幸的是,完全独立的线程并不是大多数多线程系统的行为方式。如果您回头参考图 4-2,您将看到多个线程共享进程内的全局数据。在 Java 中,这既是全局数据又是静态数据。

Unfortunately, totally independent threads are not how most multithreaded systems behave. If you refer back to Figure 4-2, you will see that multiple threads share the global data within a process. In Java this is both global and static data.

线程可以使用共享数据结构来协调它们的工作并跨线程传达状态。例如,我们可能有处理来自 Web 客户端的请求的线程,每个请求一个线程。我们还希望保留每天处理的请求总数。当线程完成请求时,它会增加RequestCounter所有线程在每次请求后共享和更新的全局对象。最终,我们知道处理了多少请求。确实是一个简单而优雅的解决方案。也许会?

Threads can use shared data structures to coordinate their work and communicate status across threads. For example, we may have threads handling requests from web clients, one thread per request. We also want to keep a running total of how many requests we process each day. When a thread completes a request, it increments a global RequestCounter object that all threads share and update after each request. At the end of the day, we know how many requests were processed. A simple and elegant solution indeed. Well, maybe?

下面的代码显示了一个非常简单的实现,它模仿了请求计数器示例场景。它创建 50,000 个线程来更新共享计数器。请注意,为了简洁起见,我们使用 lambda 函数来创建线程,并使用(非常糟糕的主意)5 秒的延迟来main允许线程完成:1

The code below shows a very simple implementation that mimics the request counter example scenario. It creates 50,000 threads to update a shared counter. Note we use a lambda function for brevity to create the threads, and a (really bad idea) 5-second delay in main to allow the threads to finish:1

公共类请求计数器{
  最终静态私有 int NUMTHREADS = 50000;
  私有整数计数 = 0;
    
  公共无效公司(){
    计数++;
  }
    
  公共 int getVal() {
    返回这个.count;
  }
    
  公共静态无效主(字符串[] args)抛出InterruptedException {
    最终 RequestCounter 计数器 = new RequestCounter();
          
    for (int i = 0; i < NUMTHREADS; i++) {
      // lambda 可运行创建
      可运行线程 = () -> {counter.inc(); };
        新线程(线程).start();
    }
          
    线程.sleep(5000);
    System.out.println("值应该是 " + NUMTHREADS + "是:" + counter.getVal());
  }
}
public class RequestCounter {
  final static private int NUMTHREADS = 50000;
  private int count = 0;
    
  public  void inc() {
    count++;
  }
    
  public int getVal() {
    return this.count;
  }
    
  public static void main(String[] args) throws InterruptedException {
    final RequestCounter counter = new RequestCounter();
          
    for (int i = 0; i < NUMTHREADS; i++) {
      // lambda runnable creation 
      Runnable thread = () -> {counter.inc(); };
        new Thread(thread).start();
    }
          
    Thread.sleep(5000);
    System.out.println("Value should be " + NUMTHREADS + "It is: " +     counter.getVal());
  }
}

您可以在家做的就是从本书的 GitHub 存储库克隆此代码,运行此代码几次,然后看看会得到什么结果。在 10 次处决中,我的平均值是 49,995。我一次也没有得到50,000的正确答案。诡异的。

What you can do at home is clone this code from the book’s GitHub repo, run this code a few times, and see what results you get. In 10 executions my mean was 49,995. I didn’t once get the correct answer of 50,000. Weird.

为什么?

Why?

答案在于如何在机器上执行抽象的高级编程语言语句(本例中为 Java)。在此示例中,执行增量计数器,CPU 必须:

The answer lies in how abstract, high-level programming language statements, in Java in this case, are executed on a machine. In this example, to perform an increment of a counter, the CPU must:

  • 将当前值加载到寄存器中。

  • Load the current value into a register.

  • 增加寄存器值。

  • Increment the register value.

  • 将结果写回原始内存位置。

  • Write the results back to the original memory location.

这个简单的增量实际上是三个机器级操作的序列。

This simple increment is actually a sequence of three machine-level operations.

如图4-3所示,在机器级别,这三个操作是独立的,不被视为单个原子操作。通过原子,我们的意思是一个不能被中断的操作,因此一旦开始就会运行到完成。

As Figure 4-3 shows, at the machine level these three operations are independent and not treated as a single atomic operation. By atomic, we mean an operation that cannot be interrupted and hence once started will run to completion.

随着增量操作在机器级别上不是原子的,一个线程可以将计数器值从内存加载到 CPU 寄存器中,但在将递增的值写回之前,调度程序会抢占该线程并允许另一个线程启动。该线程从内存加载计数器的旧值并写回递增的值。最终,原始线程再次执行并写回其增量值,该值恰好与内存中已有的值相同。

As the increment operation is not atomic at the machine level, one thread can load the counter value into a CPU register from memory, but before it writes the incremented value back, the scheduler preempts the thread and allows another thread to start. This thread loads the old value of the counter from memory and writes back the incremented value. Eventually the original thread executes again and writes back its incremented value, which just happens to be the same as what is already in memory.

这意味着我们丢失了更新。从上面对计数器代码的 10 次测试中,我们发现这种情况平均以 50,000 次增量发生 5 次。因此,此类事件很少见,但即使发生千万分之一,你仍然会得到不正确的结果。

This means we’ve lost an update. From our 10 tests of the counter code above, we see this is happening on average 5 times in 50,000 increments. Hence such events are rare, but even if it happens 1 time in 10 million, you still have an incorrect result.

增量在机器级别不是原子的
图 4-3。增量在机器级别不是原子的

当我们以这种方式丢失更新时,称为竞争条件。每当多个线程对某些共享状态(在本例中是一个简单的计数器)进行更改时,就会出现竞争条件。本质上,不同的线程交错可以产生不同的结果。

When we lose updates in this manner, it is called a race condition. Race conditions can occur whenever multiple threads make changes to some shared state, in this case a simple counter. Essentially, different interleavings of the threads can produce different results.

竞态条件是阴险、邪恶的错误,因为它们的发生通常很少见,并且很难检测到,因为大多数时候答案都是正确的。尝试使用 1,000 个线程而不是 50,000 个线程运行多线程计数器代码示例,您将看到它的实际效果。我十分之九得到了正确答案。

Race conditions are insidious, evil errors, because their occurrence is typically rare, and they can be hard to detect as most of the time the answer will be correct. Try running the multithreaded counter code example with 1,000 threads instead of 50,000, and you will see this in action. I got the correct answer nine times out of ten.

所以,这种情况可以概括为“相同的代码,偶尔会出现不同的结果”。就像我说的,竞争条件是邪恶的!幸运的是,如果采取一些预防措施,根除它们是很简单的。

So, this situation can be summarized as “same code, occasionally different results.” Like I said, race conditions are evil! Luckily, eradicating them is straightforward if you take a few precautions.

关键是识别和保护关键部分。关键部分是更新共享数据结构的代码,因此如果被多个线程访问,则必须以原子方式执行。递增共享计数器的示例是临界区的一个示例。另一个是从列表中删除一个项目。我们需要删除列表的头节点,并将对列表头的引用从删除的节点移动到列表中的下一个节点。这两个操作都必须以原子方式执行,以保持列表的完整性。这是一个关键部分。

The key is to identify and protect critical sections. A critical section is a section of code that updates shared data structures and hence must be executed atomically if accessed by multiple threads. The example of incrementing a shared counter is an example of a critical section. Another is removing an item from a list. We need to delete the head node of the list and move the reference to the head of the list from the removed node to the next node in the list. Both operations must be performed atomically to maintain the integrity of the list. This is a critical section.

在Java中,synchronized关键字定义一个关键部分。如果用于修饰一个方法,那么当多个线程尝试在同一个共享对象上调用该方法时,只有一个线程被允许进入临界区。所有其他线程都会阻塞,直到线程退出同步方法,此时调度程序选择下一个线程来执行关键部分。我们说临界区的执行是串行的,因为一次只有一个线程可以执行其中的代码。

In Java, the synchronized keyword defines a critical section. If used to decorate a method, then when multiple threads attempt to call that method on the same shared object, only one is permitted to enter the critical section. All others block until the thread exits the synchronized method, at which point the scheduler chooses the next thread to execute the critical section. We say the execution of the critical section is serialized, as only one thread at a time can be executing the code inside it.

因此,要修复反例,您只需将该inc()方法标识为关键部分并使其成为同步方法,即:

To fix the counterexample, you therefore just need to identify the inc() method as a critical section and make it a synchronized method, i.e.:

同步公共无效公司(){
    计数++;
  }
synchronized public void inc() {
    count++;
  }

您可以根据需要多次测试。你总会得到正确的答案。更正式地说,这意味着调度程序向我们抛出的任何线程交错都将始终产生正确的结果。

Test it out as many times as you like. You’ll always get the correct answer. Slightly more formally, this means any interleaving of the threads that the scheduler throws at us will always produce the correct results.

synchronized关键字还可以应用于方法内的语句块。例如,我们可以将上面的例子重写为:

The synchronized keyword can also be applied to blocks of statements within a method. For example, we could rewrite the above example as:

公共无效公司(){
        同步(这个){
           计数++;   
        }
}
public void inc() {
        synchronized(this){
           count++;   
        }
}

在幕后,每个 Java 对象都有一个监视器锁(有时称为内在锁)作为其运行时表示的一部分。监视器就像长途巴士上的卫生间一样,一次只允许(并且应该!)一个人进入,并且门锁在使用时阻止其他人进入。

Underneath the covers, every Java object has a monitor lock, sometimes known as an intrinsic lock, as part of its runtime representation. The monitor is like the bathroom on a long-distance bus—only one person is allowed to (and should!) enter at once, and the door lock stops others from entering when in use.

在我们完全卫生的 Java 运行时环境中,线程必须获取监视器锁才能进入同步方法或同步语句块。任何时候只有一个线程可以拥有锁,因此执行是串行的。基本上,这就是 Java 和类似语言实现临界区的方式。

In our totally sanitary Java runtime environment, a thread must acquire the monitor lock to enter a synchronized method or synchronized block of statements. Only one thread can own the lock at any time, and hence execution is serialized. This, very basically, is how Java and similar languages implement critical sections.

根据经验,您应该使关键部分尽可能小,以便最大限度地减少序列化代码。这会对性能和可扩展性产生积极影响。稍后我将回到这个主题,但我实际上再次谈论阿姆达尔定律,如第 2 章中介绍的那样。同步块是 Amdahl 所描述的系统的序列化部分,它们执行的时间越长,那么系统可扩展性的潜力就越小。

As a rule of thumb, you should keep critical sections as small as possible so that the serialized code is minimized. This can have positive impacts on performance and hence scalability. I’ll return to this topic later, but I’m really talking about Amdahl’s law again, as introduced in Chapter 2. Synchronized blocks are the serialized parts of a system as described by Amdahl, and the longer they execute for, then the less potential we have for system scalability.

僵局

Deadlocks

为了确保多线程代码中的正确结果,我解释说我们必须限制固有的不确定性序列化对关键部分的访问。这避免了竞争条件。但是,如果我们不小心,我们可能会编写过多限制非确定性的代码,从而导致程序停止。并且永远不会继续。这正式称为死锁。

To ensure correct results in multithreaded code, I explained that we have to restrict the inherent nondeterminism to serialize access to critical sections. This avoids race conditions. However, if we are not careful, we can write code that restricts nondeterminism so much that our program stops. And never continues. This is formally known as a deadlock.

当两个或多个线程永远被阻塞并且没有一个线程可以继续时,就会发生死锁。当线程需要独占访问一组共享资源并且线程以不同的顺序获取锁时,就会发生这种情况。下面的示例对此进行了说明,其中两个线程需要对临界区 A 和 B 进行独占访问。线程 1获取临界区 A 的锁,线程 2 获取临界区 B 的锁。然后,两个线程都将永远阻塞,因为它们无法获取他们继续需要的锁。

A deadlock occurs when two or more threads are blocked forever, and none can proceed. This happens when threads need exclusive access to a shared set of resources and the threads acquire locks in different orders. This is illustrated in the example below in which two threads need exclusive access to critical sections A and B. Thread 1 acquires the lock for critical section A, and thread 2 acquires the lock for critical section B. Both then block forever as they cannot acquire the locks they need to continue.

两个线程通过同步块共享对两个共享变量的访问:

Two threads sharing access to two shared variables via synchronized blocks:

  • 线程1:进入临界区A。

  • Thread 1: enters critical section A.

  • 线程2:进入临界区B。

  • Thread 2: enters critical section B.

  • 线程 1:在进入临界区 B 时阻塞。

  • Thread 1: blocks on entry to critical section B.

  • 线程 2:在进入临界区 A 时阻塞。

  • Thread 2: blocks on entry to critical section A.

  • 两个线程都永远等待。

  • Both threads wait forever.

死锁,也称为致命的拥抱,导致程序停止。不需要生动的想象力就可以意识到这可能会导致各种不良结果。当我的自动驾驶汽车载我去酒吧时,我很高兴地发短信。突然,车辆代码陷入僵局。不会有好结局的。

A deadlock, also known as a deadly embrace, causes a program to stop. It doesn’t take a vivid imagination to realize that this can cause all sorts of undesirable outcomes. I’m happily texting away while my autonomous vehicle drives me to the bar. Suddenly, the vehicle code deadlocks. It won’t end well.

死锁发生在比上面的简单示例更微妙的情况下。典型的例子是哲学家就餐问题。故事是这样的。

Deadlocks occur in more subtle circumstances than the simple example above. The classic example is the dining philosophers problem. The story goes like this.

五位哲学家围坐在一起桌子。作为哲学家,他们花费大量时间进行深入思考。在深度思考的间隙,他们通过吃面前的一盘食物来补充大脑功能。因此,哲学家要么在吃饭,要么在思考,或者在这两种状态之间转换。

Five philosophers sit around a shared table. Being philosophers, they spend a lot of time thinking deeply. In between bouts of deep thinking, they replenish their brain function by eating from a plate of food that sits in front of them. Hence a philosopher is either eating or thinking or transitioning between these two states.

此外,哲学家们必须都是身体非常亲密、高度灵巧且接种了 COVID-19 疫苗的朋友,因为他们共用筷子吃饭。桌子上只有五根筷子,放在每个哲学家之间。当一位哲学家想要吃饭时,他们遵循先拿起左手筷子,然后拿起右手筷子的协议。一旦他们准备好再次思考,他们就会先归还右边的筷子,然后再归还左边的筷子。

In addition, the philosophers must all be physically very close, highly dexterous, and COVID-19 vaccinated friends, as they share chopsticks to eat. Only five chopsticks are on the table, placed between each philosopher. When one philosopher wishes to eat, they follow a protocol of picking up their left chopstick first, then their right chopstick. Once they are ready to think again, they first return the right chopstick, then the left.

图 4-4描绘了我们的哲学家,每个哲学家都由一个唯一的数字标识。由于每个哲学家要么同时吃饭要么思考,我们可以将每个哲学家建模为一条线程。

Figure 4-4 depicts our philosophers, each identified by a unique number. As each is either concurrently eating or thinking, we can model each philosopher as a thread.

哲学家就餐问题
图 4-4。哲学家就餐问题

代码如例 4-1所示。共享筷子由 Java 类的实例表示Object。由于任何时候只有一个对象可以持有一个对象的监视器锁,因此它们被用作临界区的进入条件,哲学家在临界区中获取他们需要吃的筷子。吃完后,将筷子放回桌子上,并释放每根筷子的锁,以便邻近的哲学家可以在准备好时就餐。

The code is shown in Example 4-1. The shared chopsticks are represented by instances of the Java Object class. As only one object can hold the monitor lock on an object at any time, they are used as entry conditions to the critical sections in which the philosophers acquire the chopsticks they need to eat. After eating, the chopsticks are returned to the table and the lock is released on each so that neighboring philosophers can eat whenever they are ready.

例 4-1。哲学家的线索
公共类 Philosopher 实现 Runnable {

  私有最终对象 leftChopStick;
  私有最终对象 rightChopStick;

  哲学家(对象 leftChopStick,对象 rightChopStick){
    this.leftChopStick = leftChopStick;
    this.rightChopStick = rightChopStick;
  }
  私人无效LogEvent(字符串事件)抛出InterruptedException {
    System.out.println(Thread.currentThread()
                                  .getName() + " " + 事件);
    线程睡眠(1000);
  }

  公共无效运行(){
    尝试 {
      而(真){
        LogEvent(": 深入思考");
        同步(左ChopStick){
          LogEvent(": 拿起左筷子");
          同步(rightChopStick){
            LogEvent(": 拿起右手筷子——吃饭");
            LogEvent(": 放下右手筷子");
          }
          LogEvent(": 放下左筷子,吃太多了");
        }
      } // 结束同时
    } catch (InterruptedException e) {
       Thread.currentThread().interrupt();
  }
 }
}
public class Philosopher implements Runnable {

  private final Object leftChopStick;
  private final Object rightChopStick;

  Philosopher(Object leftChopStick, Object rightChopStick) {
    this.leftChopStick = leftChopStick;
    this.rightChopStick = rightChopStick;
  }
  private void LogEvent(String event) throws InterruptedException {
    System.out.println(Thread.currentThread()
                                  .getName() + " " + event);
    Thread.sleep(1000);
  }

  public void run() {
    try {
      while (true) {
        LogEvent(": Thinking deeply"); 
        synchronized (leftChopStick) {
          LogEvent( ": Picked up left chopstick");
          synchronized (rightChopStick) {
            LogEvent(": Picked up right chopstick – eating");
            LogEvent(": Put down right chopstick");
          }
          LogEvent(": Put down left chopstick. Ate too much");
        }
      } // end while
    } catch (InterruptedException e) {
       Thread.currentThread().interrupt();
  }
 }
}

为了让例 4-1中描述的哲学家栩栩如生,我们必须为每个哲学家实例化一个线程,并让每个哲学家都能访问其相邻的筷子。这是通过例 4-21中的线程构造函数调用来完成的。在循环中,我们创建了五个哲学家,并将它们作为独立的线程启动,其中每根筷子可由两个线程访问,一个作为左筷子,一个作为右筷子。for

To bring the philosophers described in Example 4-1 to life, we must instantiate a thread for each and give each philosopher access to their neighboring chopsticks. This is done through the thread constructor call at in Example 4-2. In the for loop we create five philosophers and start these as independent threads, where each chopstick is accessible to two threads, one as a left chopstick, and one as a right.

例 4-2。哲学家就餐——僵局版本
私有最终静态 int NUMCHOPSTICKS = 5 ;
私有最终静态 int NUMPHILOSOPHERS = 5;
公共静态无效主(字符串[] args)抛出异常{
 
  最终哲学家[] ph =新哲学家[NUMPHILOSOPHERS];
  Object[] ChopSticks = new Object[NUMCHOPSTICKS];
 
  for (int i = 0; i < NUMCHOPSTICKS; i++) {
    ChopSticks[i] = new Object();
  }
 
  for (int i = 0; i < 数字; i++) {
    对象 leftChopStick = ChopSticks[i];
    对象rightChopStick=chopSticks[(i+1)%chopSticks.length];
    
    ph[i] = new Philosopher(leftChopStick, rightChopStick);  1
          
    Thread th = new Thread(ph[i], "哲学家" + i);
    th.start();
  }
}
private final static int NUMCHOPSTICKS = 5 ;
private final static int NUMPHILOSOPHERS = 5; 
public static void main(String[] args) throws Exception {
 
  final Philosopher[] ph = new Philosopher[NUMPHILOSOPHERS];
  Object[] chopSticks = new Object[NUMCHOPSTICKS];
 
  for (int i = 0; i < NUMCHOPSTICKS; i++) {
    chopSticks[i] = new Object();
  }
 
  for (int i = 0; i < NUMPHILOSOPHERS; i++) {
    Object leftChopStick = chopSticks[i];
    Object rightChopStick = chopSticks[(i + 1) % chopSticks.length];
    
    ph[i] = new Philosopher(leftChopStick, rightChopStick);  
          
    Thread th = new Thread(ph[i], "Philosopher " + i);
    th.start();
  }
}

在我第一次尝试时运行此代码会产生以下输出。如果运行代码,您几乎肯定会看到不同的输出,但最终结果将是相同的:

Running this code produces the following output on my first attempt. If you run the code you will almost certainly see different outputs, but the final outcome will be the same:

哲学家3:深入思考
哲学家4:深入思考
哲学家0:深入思考
哲学家1:深入思考
哲学家2:深入思考
哲学家3:拿起左筷子
哲学家0:拿起左边的筷子
哲学家2:拿起左边的筷子
哲学家4:拿起左筷子
哲学家1:拿起左边的筷子
Philosopher 3 : Thinking deeply
Philosopher 4 : Thinking deeply
Philosopher 0 : Thinking deeply
Philosopher 1 : Thinking deeply
Philosopher 2 : Thinking deeply
Philosopher 3 : Picked up left chopstick
Philosopher 0 : Picked up left chopstick
Philosopher 2 : Picked up left chopstick
Philosopher 4 : Picked up left chopstick
Philosopher 1 : Picked up left chopstick

十行输出,然后……什么也没有!我们陷入了僵局。这是一个经典的循环等待死锁。想象一下以下场景:

Ten lines of output, then…nothing! We have a deadlock. This is a classic circular waiting deadlock. Imagine the following scenario:

  • 每个哲学家都沉迷于长时间的思考。

  • Each philosopher indulges in a long thinking session.

  • 与此同时,他们都觉得自己饿了,于是伸手去拿左边的筷子。

  • Simultaneously, they all decide they are hungry and reach for their left chopstick.

  • 没有哲学家可以吃饭(继续),就像没有人可以拿起右手的筷子一样。

  • No philosopher can eat (proceed) as none can pick up their right chopstick.

在这种情况下,真正的哲学家会想出一些办法,放下一两根筷子,直到他们的一个或多个同事可以吃饭。有时我们可以通过在阻塞操作上使用超时来在我们的软件中做到这一点。当超时到期时,线程释放临界区并重试,从而允许其他被阻塞的线程有机会继续执行。但这并不是最佳选择,因为阻塞线程会损害性能,并且设置超时值是一门不精确的科学。

Real philosophers in this situation would figure out some way to proceed by putting down a chopstick or two until one or more of their colleagues can eat. We can sometimes do this in our software by using timeouts on blocking operations. When the timeout expires, a thread releases the critical section and retries, allowing other blocked threads a chance to proceed. This is not optimal though, as blocked threads hurt performance and setting timeout values is an inexact science.

因此,最好设计一个无死锁的解决方案。这意味着一个或多个线程始终能够取得进展。对于循环等待死锁,这可以通过在共享资源上施加资源分配协议来实现,这样线程就不会总是以相同的顺序请求资源。

It is much better, therefore, to design a solution to be deadlock-free. This means that one or more threads will always be able to make progress. With circular wait deadlocks, this can be achieved by imposing a resource allocation protocol on the shared resources, so that threads will not always request resources in the same order.

在哲学家就餐问题中,我们可以通过确保其中一位哲学家首先拿起他正确的筷子来做到这一点。假设我们指示哲学家 4 这样做。这导致可能的操作顺序如下:

In the dining philosophers problem, we can do this by making sure one of our philosophers picks up their right chopstick first. Let’s assume we instruct Philosopher 4 to do this. This leads to a possible sequence of operations such as below:

  • 哲学家 0 拿起左筷子 ( chopStick[0]),然后拿起右筷子 ( chopStick[1])

  • Philosopher 0 picks up left chopstick (chopStick[0]) then right (chopStick[1])

  • 哲学家 1 拿起左筷子 ( chopStick[1]),然后拿起右筷子 ( chopStick[2])

  • Philosopher 1 picks up left chopstick (chopStick[1]) then right (chopStick[2])

  • 哲学家 2 拿起左筷子 ( chopStick[2]),然后拿起右筷子 ( chopStick[3])

  • Philosopher 2 picks up left chopstick (chopStick[2]) then right (chopStick[3])

  • 哲学家 3 拿起左筷子 ( chopStick[3]),然后拿起右筷子 ( chopStick[4])

  • Philosopher 3 picks up left chopstick (chopStick[3]) then right (chopStick[4])

  • 哲学家 4 拿起右手筷子 ( chopStick[0]),然后拿起左筷子 ( chopStick[4])

  • Philosopher 4 picks up right chopstick (chopStick[0]) then left (chopStick[4])

在此示例中,哲学家 4 必须阻止,因为哲学家 0 已经获得了对 的访问权限chopstick[0]。当哲学家 4 被阻止时,哲学家 3 就可以确保访问chopstick[4]并继续满足他们的胃口。

In this example, Philosopher 4 must block, as Philosopher 0 already has acquired access to chopstick[0]. With Philosopher 4 blocked, Philosopher 3 is assured access to chopstick[4] and can then proceed to satisfy their appetite.

哲学家就餐解决方案的修复如示例 4-3所示。

The fix for the dining philosophers solution is shown in Example 4-3.

例 4-3。解决哲学家就餐僵局
if (i == 数字哲学家 - 1) {
  // 最后一位哲学家先拿起右边的筷子
  ph[i] = new Philosopher(rightChopStick, leftChopStick);
} 别的 {
  // 其他人先拿起左边的筷子
  ph[i] = new Philosopher(leftChopStick, rightChopStick);
}
if (i == NUMPHILOSOPHERS - 1) {
  // The last philosopher picks up the right chopstick first
  ph[i] = new Philosopher(rightChopStick, leftChopStick); 
} else {
  // all others pick up the left chopstick first 
  ph[i] = new Philosopher(leftChopStick, rightChopStick);
}

更正式地说,我们对共享资源的获取施加命令,例如:

More formally we are imposing an ordering on the acquisition of shared resources, such that:

chopStick[0]<<<< chopStick[1]_ chopStick[2]_ chopStick[3]_chopStick[4]

chopStick[0] < chopStick[1] < chopStick[2] < chopStick[3] < chopStick[4]

这意味着每个线程将始终尝试获取chopstick[0]before chopstick[1]chopstick[1]beforechopstick[2]等。对于 Philosopher 4 来说,这意味着他们将尝试获取chopstick[0]before chopstick[4],从而打破循环等待死锁的可能性。

This means each thread will always attempt to acquire chopstick[0] before chopstick[1], and chopstick[1] before chopstick[2], and so on. For Philosopher 4, this means they will attempt to acquire chopstick[0] before chopstick[4], thus breaking the potential for a circular wait deadlock.

死锁是一个复杂的主题,本节仅触及表面。您会在许多分布式系统中看到死锁。例如,用户请求获取学生数据库表中某些数据的锁定,然后必须更新班级表中的行以反映学生出勤情况。同时另一个用户请求获取Classes表上的锁,接下来必须更新Students表中的一些信息。如果这些请求交织在一起,使得每个请求请求以重叠的方式获取锁,我们遇到了死锁。

Deadlocks are a complicated topic and this section has just scratched the surface. You’ll see deadlocks in many distributed systems. For example, a user request acquires a lock on some data in a Students database table, and must then update rows in the Classes table to reflect student attendance. Simultaneously another user request acquires locks on the Classes table, and next must update some information in the Students table. If these requests interleave such that each request acquires locks in an overlapping fashion, we have a deadlock.

线程状态

Thread States

多线程系统有一个系统调度程序来决定何时运行哪些线程。在 Java 中,调度程序被称为抢占式、基于优先级的调度程序。简而言之,这意味着它选择执行希望运行的最高优先级线程。

Multithreaded systems have a system scheduler that decides which threads to run when. In Java, the scheduler is known as a preemptive, priority-based scheduler. In short, this means it chooses to execute the highest priority thread which wishes to run.

每个线程都有一个优先级(默认为 5,范围为 0 到 10)。线程从其父线程继承其优先级。较高优先级线程比较低优先级线程更频繁地调度,但在大多数应用程序中,将所有线程设置为默认优先级就足够了。

Every thread has a priority (by default 5, range 0 to 10). A thread inherits its priority from its parent thread. Higher priority threads get scheduled more frequently than lower priority threads, but in most applications having all threads as the default priority suffices.

调度程序根据线程的行为在四种不同的状态之间循环线程。这些都是:

The scheduler cycles threads through four distinct states, based on their behavior. These are:

已创建
Created
线程对象已创建但其start()方法尚未被调用。一旦start()被调用,线程就进入可运行状态。
A thread object has been created but its start() method has not been invoked. Once start() is invoked, the thread enters the runnable state.
可运行
Runnable
线程能够运行。调度程序将选择以先进先出 (FIFO) 方式执行的线程 - 一个线程可以随时分配给节点中的每个核心。然后,线程执行,直到它们阻塞(例如,在synchronized语句上)、执行yield()suspend()sleep()语句、run()方法终止或被调度程序抢占。当较高优先级的线程变得可运行时,或者当系统特定的时间段(称为时间片)到期时,就会发生抢占。基于时间切片的抢占使调度程序能够确保所有线程最终都有机会执行——任何需要执行的线程都不会占用 CPU。
A thread is able to run. The scheduler will choose which thread(s) to execute in a first-in, first-out (FIFO) manner—one thread can be allocated at any time to each core in the node. Threads then execute until they block (e.g., on a synchronized statement), execute a yield(), suspend(), or sleep() statement, the run() method terminates, or they are preempted by the scheduler. Preemption occurs when a higher priority thread becomes runnable, or when a system-specific time period, known as a time slice, expires. Preemption based on time slicing allows the scheduler to ensure that all threads eventually get a chance to execute—no execution-hungry threads can hog the CPU.
被阻止
Blocked
如果线程正在等待,则该线程将被阻塞对于锁,通知事件发生(例如,睡眠定时器到期,resume()方法执行),或者正在等待网络或磁盘请求完成。当阻塞线程正在等待的特定事件发生时,它会返回到可运行状态。
A thread is blocked if it is waiting for a lock, a notification event to occur (e.g., sleep timer to expire, resume() method executed), or is waiting for a network or disk request to complete. When the specific event a blocked thread is waiting for occurs, it moves back to the runnable state.
终止
Terminated
线程的run()方法有已完成或已调用该stop()方法。该线程将不再被调度。
A thread’s run() method has completed or it has called the stop() method. The thread will no longer be scheduled.

该方案的示例如图4-5所示。调度程序有效地将每个线程优先级的 FIFO 队列维护在可运行状态。高优先级线程通常用于响应事件(例如,紧急计时器)并在短时间内执行。低优先级线程用于后台持续的任务,例如通过重新计算校验和来检查磁盘上的文件是否损坏。后台线程基本上会耗尽空闲的 CPU 周期。

An illustration of this scheme is in Figure 4-5. The scheduler effectively maintains FIFO queue in the runnable state for each thread priority. High-priority threads are typically used to respond to events (e.g., an emergency timer) and execute for a short period of time. Low-priority threads are used for background, ongoing tasks like checking for corruption of files on disk through recalculating checksums. Background threads basically use up idle CPU cycles.

线程状态和转换
图 4-5。线程状态和转换

线程协调

Thread Coordination

有很多问题需要线程具有不同的角色来协调他们的活动。想象一下一组线程,每个线程接受来自用户的文档,对文档进行一些处理(例如,生成 PDF),然后将处理后的文档发送到共享打印机池。每台打印机一次只能打印一个文档,因此它们从共享打印队列中读取,按照文档到达的顺序抓取并打印文档。

There are many problems that require threads with different roles to coordinate their activities. Imagine a collection of threads that each accept documents from users, do some processing on the documents (e.g., generate a PDF), and then send the processed document to a shared printer pool. Each printer can only print one document at a time, so they read from a shared print queue, grabbing and printing documents in the order they arrive.

这个打印问题是经典的生产者-消费者问题的一个例证。生产者产生并通过共享 FIFO 缓冲区向消费者发送消息。消费者检索这些消息,处理它们,然后从缓冲区请求更多工作。图 4-6显示了该问题的简单说明。这有点像一家365天24小时营业的自助餐厅——厨房不停地生产,服务员收集食物并放入自助餐中,饥饿的食客可以自助。永远。

This printing problem is an illustration of the classic producer-consumer problem. Producers generate and send messages via a shared FIFO buffer to consumers. Consumers retrieve these messages, process them, and then ask for more work from the buffer. A simple illustration of this problem is shown in Figure 4-6. It’s a bit like a 24-hour, 365-day buffet restaurant—the kitchen keeps producing, the waitstaff collect the food and put it in the buffet, and hungry diners help themselves. Forever.

生产者-消费者问题
图 4-6。生产者-消费者问题

与几乎所有实际资源一样,缓冲区的容量是有限的。生产者生成新的项目,但如果缓冲区已满,他们必须等到某些项目被消耗后才能将新项目添加到缓冲区。类似地,如果消费者的消费速度比生产者的生产速度快,那么如果缓冲区中没有项目,他们必须等待,并在新项目到达时以某种方式收到警报。

Like virtually all real resources, the buffer has a limited capacity. Producers generate new items, but if the buffer is full, they must wait until some item(s) have been consumed before they can add the new item to the buffer. Similarly, if the consumers are consuming faster than the producers are producing, then they must wait if there are no items in the buffer, and somehow get alerted when new items arrive.

生产者等待缓冲区中的空间或消费者等待项目的一种方法是不断重试操作。生产者可以休眠一秒钟,然后重试 put 操作,直到成功。消费者也可以这样做。

One way for a producer to wait for space in the buffer, or a consumer to wait for an item, is to keep retrying an operation. A producer could sleep for a second, and then retry the put operation until it succeeds. A consumer could do likewise.

这个解决方案称为 轮询,或忙等待。它工作得很好,但正如第二个名字所暗示的那样,每个生产者和消费者每次重试和失败时都在使用资源(CPU、内存,也许是网络?)。如果这不是问题,那么很酷,但在可扩展系统中,我们始终致力于优化资源使用,而轮询可能会造成浪费。

This solution is called polling, or busy waiting. It works fine, but as the second name implies, each producer and consumer are using resources (CPU, memory, maybe network?) each time they retry and fail. If this is not a concern, then cool, but in scalable systems we are always aiming to optimize resource usage, and polling can be wasteful.

更好的解决方案是生产者和消费者阻塞,直到他们想要的操作(分别放置或获取)能够成功。阻塞线程不消耗资源,因此提供了一种有效的解决方案。为了实现这一点,线程编程模型提供了阻塞操作,使线程能够在事件发生时向其他线程发出信号。对于生产者-消费者问题,基本方案如下:

A better solution is for producers and consumers to block until their desired operation, put or get respectively, can succeed. Blocked threads consume no resources and hence provide an efficient solution. To facilitate this, thread programming models provide blocking operations that enable threads to signal to other threads when an event occurs. With the producer-consumer problem, the basic scheme is as follows:

  • 当生产者将一个项目添加到缓冲区时,它会向任何阻塞的消费者发送一个信号,通知他们缓冲区中有一个项目。

  • When a producer adds an item to the buffer, it sends a signal to any blocked consumers to notify them that there is an item in the buffer.

  • 当消费者从缓冲区检索项目时,它会向任何被阻止的生产者发送信号,通知他们缓冲区中有新项目的容量。

  • When a consumer retrieves an item from the buffer, it sends a signal to any blocked producers to notify them there is capacity in the buffer for new items.

在Java中,有两个基本的原语,即wait()notify(),可用于实现该信令方案。简而言之,它们的工作原理如下:

In Java, there are two basic primitives, namely wait() and notify(), that can be used to implement this signaling scheme. Briefly, they work like this:

  • 如果线程wait()需要保持的某些条件不成立,则线程可以在同步块内调用。例如,一个线程可能尝试从缓冲区检索消息,但如果缓冲区没有要检索的消息,则它会调用wait()并阻塞,直到另一个线程添加消息、将条件设置为true并调用notify()同一对象。

  • A thread may call wait() within a synchronized block if some condition it requires to hold is not true. For example, a thread may attempt to retrieve a message from a buffer, but if the buffer has no messages to retrieve, it calls wait() and blocks until another thread adds a message, sets the condition to true, and calls notify() on the same object.

  • notify()唤醒调用该对象的线程wait()

  • notify() wakes up a thread that has called wait() on the object.

这些 Java 原语用于实施受保护的块。受保护的块使用条件作为保护,该条件在线程恢复执行之前必须保持。以下代码片段显示了如何使用保护条件empty来阻止尝试从空缓冲区检索消息的线程:

These Java primitives are used to implement guarded blocks. Guarded blocks use a condition as a guard that must hold before a thread resumes the execution. The following code snippet shows how the guard condition, empty, is used to block a thread that is attempting to retrieve a message from an empty buffer:

而(空){
  尝试 {
    System.out.println("等待消息");
    等待();
  } catch (InterruptedException e) {}
}
while (empty) {
  try {
    System.out.println("Waiting for a message");
    wait();
  } catch (InterruptedException e) {}
}

当另一个线程向缓冲区添加消息时,它执行notify()如下:

When another thread adds a message to the buffer, it executes notify() as follows:

// 存储消息。
this.message = 消息;
空=假;
// 通知消费者消息可用
通知();
// Store message.
this.message = message;
empty = false;
// Notify consumer that message is available
notify();

Git 存储库一书中的代码示例给出了此示例的完整实现。wait()和方法有许多变体notify(),但这些超出了我在本概述中可以涵盖的范围。幸运的是,Java 为我们提供了线程安全的抽象,可以隐藏代码中的这种复杂性。

The full implementation of this example is given in the code examples in the book Git repository. There are a number of variations of the wait() and notify() methods, but these go beyond the scope of what I can cover in this overview. And luckily, Java provides us with thread-safe abstractions that hide this complexity from your code.

与生产者-消费者问题相关的一个例子BlockingQueuejava.util.concurrent.BlockingQueue. 一个BlockingQueue实现提供了一个线程安全的对象,可以用作生产者-消费者场景中的缓冲区。该接口有 5 种不同的实现BlockingQueue。我将使用其中之一LinkedBlockingQueue来实现生产者-消费者。示例 4-4中显示了这一点。

An example that is pertinent to the producer-consumer problem is the BlockingQueue interface in java.util.concurrent.BlockingQueue. A BlockingQueue implementation provides a thread-safe object that can be used as the buffer in a producer-consumer scenario. There are 5 different implementations of the BlockingQueue interface. I’ll use one of these, the LinkedBlockingQueue, to implement the producer-consumer. This is shown in Example 4-4.

例 4-4。生产者-消费者LinkedBlockingQueue
生产者消费者类 {
   公共静态无效主(字符串[]参数)
     BlockingQueue 缓冲区 = new LinkedBlockingQueue();
     生产者 p = 新生产者(缓冲区);
     消费者 c = 新消费者(缓冲区);
     新线程(p).start();
     新线程(c).start();
   }
 }

类 Producer 实现 Runnable {
   私有布尔活动= true;
   私有最终 BlockingQueue 缓冲区;
   公共生产者(BlockingQueue q){缓冲区= q; }
   公共无效运行(){
     
     尝试 {
       while (active) { buffer.put(product()); }
     } catch (InterruptedException ex) { // 处理异常}
   }
   Object Produce() { // 省略细节,设置 active=false }
 }

 类 Consumer 实现 Runnable {
   私有布尔活动= true;  
   私有最终 BlockingQueue 缓冲区;
   公共消费者(BlockingQueue q){缓冲区= q; }
   公共无效运行(){
     
     尝试 {
       while (active) { 消耗(buffer.take()); }
     } catch (InterruptedException ex) { // 处理异常 }
   }
   void Consumer(Object x) { // 省略细节,设置 active=false }
 }
class ProducerConsumer {
   public static void main(String[] args)
     BlockingQueue buffer = new LinkedBlockingQueue();
     Producer p = new Producer(buffer);
     Consumer c = new Consumer(buffer);
     new Thread(p).start();
     new Thread(c).start();
   }
 }

class Producer implements Runnable {
   private boolean active = true;
   private final BlockingQueue buffer;
   public Producer(BlockingQueue q) { buffer = q; }
   public void run() {
     
     try {
       while (active) { buffer.put(produce()); }
     } catch (InterruptedException ex) { // handle exception}
   }
   Object produce() { // details omitted, sets active=false }
 }

 class Consumer implements Runnable {
   private boolean active = true;  
   private final BlockingQueue buffer;
   public Consumer(BlockingQueue q) { buffer = q; }
   public void run() {
     
     try {
       while (active) { consume(buffer.take()); }
     } catch (InterruptedException ex) { // handle exception }
   }
   void consume(Object x) {  // details omitted, sets active=false }
 }

该解决方案使程序员无需关心协调对共享缓冲区的访问的实现,并且大大简化了代码。

This solution absolves the programmer from being concerned with the implementation of coordinating access to the shared buffer, and greatly simplifies the code.

java.util.concurrent 是构建多线程 Java 解决方案的宝库。在以下各节中,我将简要强调其中的一些内容强大且极其有用的功能。

The java.util.concurrent package is a treasure trove for building multithreaded Java solutions. In the following sections, I will briefly highlight a few of these powerful and extremely useful capabilities.

线程池

Thread Pools

许多多线程系统需要创建并管理执行类似任务的线程集合。例如,在生产者-消费者问题中,我们可以拥有一组生产者线程和一组消费者线程,所有线程都同时添加和删除项目,并协调对共享缓冲区的访问。

Many multithreaded systems need to create and manage a collection of threads that perform similar tasks. For example, in the producer-consumer problem, we can have a collection of producer threads and a collection of consumer threads, all simultaneously adding and removing items, with coordinated access to the shared buffer.

这些集合称为线程池。线程池由多个工作线程组成,这些线程通常执行类似的目的并作为一个集合进行管理。我们可以创建一个生产者线程池,所有线程都等待一个项目进行处理,将最终产品写入缓冲区,然后等待接受另一个要处理的项目。当我们停止生产物品时,可以以安全的方式关闭池,因此不会因意外异常而丢失部分处理的物品。

These collections are known as thread pools. Thread pools comprise several worker threads, which typically perform a similar purpose and are managed as a collection. We could create a pool of producer threads which all wait for an item to process, write the final product to the buffer, and then wait to accept another item to process. When we stop producing items, the pool can be shut down in a safe manner, so no partially processed items are lost through an unanticipated exception.

java.util.concurrent包中,线程接口支持池ExecutorService。这通过一组方法扩展了基本Executor接口来管理和终止池中的线程。示例4-54-6中显示了一个使用固定大小线程池的简单生产者-消费者示例。示例 4-5Producer中的类向缓冲区发送一条消息,然后终止。它只是从缓冲区中获取消息,直到收到空字符串为止,然后终止。RunnableConsumer

In the java.util.concurrent package, thread pools are supported by the ExecutorService interface. This extends the base Executor interface with a set of methods to manage and terminate threads in the pool. A simple producer-consumer example using a fixed size thread pool is shown in Examples 4-5 and 4-6. The Producer class in Example 4-5 is a Runnable that sends a single message to the buffer and then terminates. The Consumer simply takes messages from the buffer until an empty string is received, upon which it terminates.

例 4-5。生产者和消费者线程池的实现
类 Producer 实现 Runnable {
  
  私有最终 BlockingQueue 缓冲区;

  公共生产者(BlockingQueue q){缓冲区= q; }

  @覆盖
  公共无效运行(){
     
  尝试 {
    睡眠(1000);
    buffer.put("你好世界");
              
  } catch (InterruptedException ex) {
    // 处理异常
  }
 }
}

类 Consumer 实现 Runnable {
  私有最终 BlockingQueue 缓冲区;

  公共消费者(BlockingQueue q){缓冲区= q; }

  @覆盖
   公共无效运行(){
      布尔活动 = true;
      同时(活动){
          尝试 {
             String s = (String) buffer.take();
             System.out.println(s);
             if (s.equals("")) active = false;
          } catch (InterruptedException ex) {
              / 处理异常
          }
      } /
      System.out.println("消费者终止");
    }
 }
class Producer implements Runnable {
  
  private final BlockingQueue buffer;

  public Producer(BlockingQueue q) { buffer = q; }

  @Override
  public void run() {
     
  try {
    sleep(1000);
    buffer.put("hello world");
              
  } catch (InterruptedException ex) {
    // handle exception
  }
 } 
}

class Consumer implements Runnable {
  private final BlockingQueue buffer;

  public Consumer(BlockingQueue q) { buffer = q; }

  @Override
   public void run() {
      boolean active = true; 
      while (active) {
          try {
             String  s = (String) buffer.take();
             System.out.println(s);
             if (s.equals("")) active = false;
          } catch (InterruptedException ex) {
              / handle exception
          }
      } /
      System.out.println("Consumer terminating");
    }
 }

示例 4-6中,我们创建一个消费者来从缓冲区获取消息。然后我们创建一个大小为 5 的固定大小线程池来管理我们的生产者。这会导致 JVM 预分配 5 个线程,这些线程可用于执行Runnable池中执行的任何对象。

In Example 4-6, we create a single consumer to take messages from the buffer. We then create a fixed size thread pool of size 5 to manage our producers. This causes the JVM to preallocate five threads that can be used to execute any Runnable objects that are executed by the pool.

然后,在for()循环中,我们使用ExecutorService来运行 20 个生产者。因为只有 5 个线程线程池中最多只能同时执行 5 个生产者。所有其他的都被放置在由线程池管理的等待队列中。当生产者终止时,Runnable等待队列中的下一个生产者将使用池中的任何可用线程执行。

In the for() loop, we then use the ExecutorService to run 20 producers. As there are only 5 threads available in the thread pool, only a maximum of 5 producers will be executed simultaneously. All others are placed in a wait queue which is managed by the thread pool. When a producer terminates, the next Runnable in the wait queue is executed using any available thread in the pool.

一旦我们请求线程池执行所有生产者,我们就调用shutdown()池上的方法。这告诉它ExecutorService不再接受任何要运行的任务。接下来我们调用该awaitTermination()方法,该方法会阻塞调用线程,直到线程池管理的所有线程都空闲并且等待队列中没有更多工作在等待。一旦awaitTermination()返回,我们知道所有消息都已发送到缓冲区,因此将一个空字符串发送到缓冲区,该缓冲区将充当消费者的终止值。

Once we have requested all the producers to be executed by the thread pool, we call the shutdown() method on the pool. This tells the ExecutorService not to accept any more tasks to run. We next call the awaitTermination() method, which blocks the calling thread until all the threads managed by the thread pool are idle and no more work is waiting in the wait queue. Once awaitTermination() returns, we know all messages have been sent to the buffer, and hence send an empty string to the buffer which will act as a termination value for the consumer.

例 4-6。基于线程池的生产者-消费者解决方案
public static void main(String[] args) 抛出 InterruptedException
  {
    BlockingQueue 缓冲区 = new LinkedBlockingQueue();
    
    //启动单个消费者
    (new Thread(new Consumer(buffer))).start();

    ExecutorService ProducerPool = Executors.newFixedThreadPool(5);
    for (int i = 0; i < 20; i++)
      {
        生产者生产者=新生产者(缓冲区);
        System.out.println("生产者已创建");
        生产者池.execute(生产者);
      }

      生产者池.shutdown();
      ProducerPool.awaitTermination(10, TimeUnit.SECONDS);
        
      //向消费者发送终止消息
      缓冲区.put("");        
    }
public static void main(String[] args) throws InterruptedException 
  {
    BlockingQueue buffer = new LinkedBlockingQueue();
    
    //start a single consumer 
    (new Thread(new Consumer(buffer))).start();

    ExecutorService producerPool = Executors.newFixedThreadPool(5);
    for (int i = 0; i < 20; i++) 
      {
        Producer producer = new Producer(buffer) ;
        System.out.println("Producer created" );
        producerPool.execute(producer);
      }

      producerPool.shutdown();
      producerPool.awaitTermination(10, TimeUnit.SECONDS);
        
      //send termination message to consumer 
      buffer.put("");        
    }

与本章中的大多数主题一样,框架中有许多更复杂的功能Executor可用于创建多线程程序。此描述仅涵盖了基础知识。线程池很重要,因为它们使我们的系统能够合理地使用线程资源。每个线程都会消耗内存;例如,线程的堆栈大小通常约为 1 MB。此外,当我们切换执行上下文来运行新线程时,这会消耗 CPU 周期。如果我们的系统以无纪律的方式创建线程,我们最终将耗尽内存并且系统将崩溃。线程池允许我们控制创建的线程数量并有效地利用它们。

As with most topics in this chapter, there are many more sophisticated features in the Executor framework that can be used to create multithreaded programs. This description has just covered the basics. Thread pools are important as they enable our systems to rationalize the use of resources for threads. Every thread consumes memory; for example, the stack size for a thread is typically around 1 MB. Also, when we switch execution context to run a new thread, this consumes CPU cycles. If our systems create threads in an undisciplined manner, we will eventually run out of memory and the system will crash. Thread pools allow us to control the number of threads we create and utilize them efficiently.

我将在本书的其余部分讨论线程池,因为它们是高效和可扩展管理的关键概念。服务器必须满足不断增加的请求负载。

I’ll discuss thread pools throughout the remainder of this book, as they are a key concept for efficient and scalable management of the ever-increasing request loads that servers must satisfy.

屏障同步

Barrier Synchronization

我有一个高中朋友,他的家人在晚餐时间,在全家人都坐在餐桌旁之前,不允许任何人开始吃饭。我认为这很奇怪,但多年后,它可以作为屏障同步概念的一个很好的类比。当所有家庭成员都到达餐桌后才开始吃饭。

I had a high school friend whose family, at dinnertime, would not allow anyone to start eating until the whole family was seated at the table. I thought this was weird, but many years later it serves as a good analogy for the concept known as barrier synchronization. Eating commenced only after all family members arrived at the table.

多线程系统通常需要遵循这样的行为模式。想象一个多线程图像处理系统。图像到达后,图像的不同部分被传递到每个线程以执行一些转换 - 想想 Instagram 的类固醇过滤器。仅当所有线程都完成时,图像才会被完全处理。在软件系统中,我们使用一种称为屏障同步的机制来实现这种线程协调方式。

Multithreaded systems often need to follow such a pattern of behavior. Imagine a multithreaded image-processing system. An image arrives and a distinct segment of the image is passed to each thread to perform some transformation upon—think Instagram filters on steroids. The image is only fully processed when all threads have completed. In software systems, we use a mechanism called barrier synchronization to achieve this style of thread coordination.

总体方案如图4-7所示。在此示例中,main()线程创建了四个新线程,并且所有线程都独立执行,直到到达屏障定义的执行点。当每个线程到达时,它就会阻塞。当所有线程都到达这一点时,屏障就会被释放,每个线程都可以继续其处理。

The general scheme is shown in Figure 4-7. In this example, the main() thread creates four new threads and all proceed independently until they reach the point of execution defined by the barrier. As each thread arrives, it blocks. When all threads have arrived at this point, the barrier is released, and each thread can continue with its processing.

栅栏同步
图 4-7。栅栏同步

Java 提供了三个原语屏障同步。我将在这里展示这三个之一是如何CountDownLatch工作的。基本概念适用于其他屏障同步原语。

Java provides three primitives for barrier synchronization. I’ll show here how one of the three, CountDownLatch, works. The basic concepts apply to other barrier synchronization primitives.

创建 时CountDownLatch,您将一个值传递给其构造函数,该值表示必须在屏障处阻塞的线程数,然后才允许它们继续运行。这在管理系统屏障点的线程中调用——在图 4-7中,这将是main()

When you create a CountDownLatch, you pass a value to its constructor that represents the number of threads that must block at the barrier before they are all allowed to continue. This is called in the thread which is managing the barrier points for the system—in Figure 4-7 this would be main():

CountDownLatch nextPhaseSignal = new CountDownLatch(numThreads);
CountDownLatch  nextPhaseSignal = new CountDownLatch(numThreads);

接下来,您创建工作线程将执行一些操作,然后在障碍物处阻塞,直到全部完成。为此,您需要向每个线程传递一个引用CountDownLatch

Next, you create the worker threads that will perform some actions and then block at the barrier until they all complete. To do this, you need to pass each thread a reference to CountDownLatch:

for (int i = 0; i < numThreads; i++) {
            线程工作者 = new Thread(new WorkerThread(nextPhaseSignal));
            工作人员.start();
        }
for (int i = 0; i < numThreads; i++) {
            Thread worker = new Thread(new WorkerThread(nextPhaseSignal));
            worker.start();
        }

启动工作线程后,main()线程将调用该.await()方法进行阻塞,直到工作线程触发闩锁:

After launching the worker threads, the main() thread will call the .await() method to block until the latch is triggered by the worker threads:

nextPhaseSignal.await();
nextPhaseSignal.await();

每个工作线程将完成其任务,并在退出之前调用.countDown()闩锁上的方法。这会减少锁存值。当最后一个线程调用.countDown()并且锁存器值变为零时,所有调用锁存器的线程都会从阻塞.await()状态转换为可运行状态。在此阶段,我们确信所有工作人员都已完成分配的任务:

Each worker thread will complete its task and, before exiting, call the .countDown() method on the latch. This decrements the latch value. When the last thread calls .countDown() and the latch value becomes zero, all threads that have called .await() on the latch transition from the blocked to the runnable state. At this stage we are assured that all workers have completed their assigned task:

nextPhaseSignal.countDown();
nextPhaseSignal.countDown();

.countDown()当闩锁已被有效触发时,任何后续调用都将立即返回。注意.countDown()是非阻塞,对于线程在到达屏障后还有更多工作要做的应用程序来说,这是一个有用的属性。

Any subsequent calls to .countDown() will return immediately as the latch has been effectively triggered. Note .countDown() is nonblocking, which is a useful property for applications in which threads have more work to do after reaching the barrier.

此示例说明使用 aCountDownLatch来阻止单个线程,直到一组线程完成其工作。但是,如果将其值初始化为 1,则可以使用锁存器反转此用例。多个线程可以调用.await()并阻塞,直到另一个线程调用.countDown()来释放所有等待线程。此示例类似于一个简单的门,一个线程打开该门以允许其他线程的集合继续。

This example illustrates using a CountDownLatch to block a single thread until a collection of threads have completed their work. You can invert this use case with a latch, however, if you initialize its value to one. Multiple threads could call .await() and block until another thread calls .countDown() to release all waiting threads. This example is analogous to a simple gate, which one thread opens to allow a collection of others to continue.

CountDownLatch是一个简单的屏障同步器。它是一次性工具,因为初始化值无法重置。CyclicBarrierJava 中的和类提供了更复杂的功能Phaser。有了本节中屏障同步如何工作的知识,这些将是简单易懂。

CountDownLatch is a simple barrier synchronizer. It’s a single-use tool, as the initializer value cannot be reset. More sophisticated features are provided by the CyclicBarrier and Phaser classes in Java. Armed with the knowledge of how barrier synchronization works from this section, these will be straightforward to understand.

线程安全集合

Thread-Safe Collections

很多Java程序员一旦深入研究深入了解多线程程序的奥妙后,惊讶地发现java.util包中的集合并不是线程安全的。2为什么,我听到你问?幸运的是,答案很简单。这与性能有关。调用同步方法会产生开销。因此,为了实现单线程程序的更快执行,集合不是线程安全的。

Many Java programmers, once they delve into the wonders of multithreaded programs, are surprised to discover that the collections in the java.util package are not thread safe.2 Why, I hear you ask? The answer, luckily, is simple. It has to do with performance. Calling synchronized methods incurs overheads. Hence, to attain faster execution for single-threaded programs, the collections are not thread safe.

如果您想跨多个线程共享ArrayListMap、 或您最喜欢的数据结构java.util,则必须确保对该结构的修改位于关键部分。这种方法给集合的客户端带来了安全更新的负担,因此很容易出错——程序员可能会忘记在块中进行修改synchronized

If you want to share an ArrayList, Map, or your favorite data structure from java.util across multiple threads, you must ensure modifications to the structure are placed in critical sections. This approach places the burden on the client of the collection to safely make updates, and hence is error prone—a programmer might forget to make modifications in a synchronized block.

在多线程代码中使用本质上线程安全的集合总是更安全。因此,Java 集合框架提供了一个工厂方法来创建java.util集合的线程安全版本。这是创建线程安全列表的示例:

It’s always safer to use inherently thread-safe collections in your multithreaded code. For this reason, the Java collections framework provides a factory method that creates a thread-safe version of java.util collections. Here’s an example of creating a thread-safe list:

List<String> list = Collections.synchronizedList(new ArrayList<>());
List<String> list = Collections.synchronizedList(new ArrayList<>());

这里真正发生的事情是你围绕基集合类创建一个包装器,该类具有synchronized方法。当然,它们以线程安全的方式将实际工作委托给原始类。您可以对java.util包中的任何集合使用此方法,一般形式为:

What is really happening here is that you are creating a wrapper around the base collection class, which has synchronized methods. These delegate the actual work to the original class, in a thread-safe manner of course. You can use this approach for any collection in the java.util package, and the general form is:

Collections.synchronized....(新集合<>())
Collections.synchronized....(new collection<>())

其中“ ....”是ListMapSet、 等。

where “....” is List, Map, Set, and so on.

当然,当使用同步包装器时,您会因获取监视器锁和从多个线程串行访问而付出性能损失。这意味着当单个线程进行修改时整个集合被锁定,极大地限制了并发性能(再次阿姆达尔定律)。因此,Java 5.0 包含了并发集合包,即java.util.concurrent. 它包含专门为高效多线程访问而设计的丰富的类集合。

Of course, when using the synchronized wrappers, you pay the performance penalty for acquiring the monitor lock and serializing access from multiple threads. This means the whole collection is locked while a single thread makes a modification, greatly limiting concurrent performance (Amdahl’s law again). For this reason, Java 5.0 included the concurrent collections package, namely java.util.concurrent. It contains a rich collection of classes specifically designed for efficient multithreaded access.

事实上,我们已经见过其中一个类—— LinkedBlockingQueue. 这使用了一种锁定机制,可以并行地将项目添加到队列中或从队列中删除。这种更细粒度的锁定机制利用java.util.concurrent.lock.Lock类而不是监视器锁定方法。这允许在同一个集合上使用多个锁,从而实现安全的并发访问。

In fact, we’ve already seen one of these classes—the LinkedBlockingQueue. This uses a locking mechanism that enables items to be added to and removed from the queue in parallel. This finer grain locking mechanism utilizes the java.util.concurrent.lock.Lock class rather than the monitor lock approach. This allows multiple locks to be utilized on the same collection, hence enabling safe concurrent access.

提供这种更细粒度锁定的另一个非常有用的集合是ConcurrentHashMap. 这提供了与非线程安全类似的方法HashMap,但允许基于concurrencyLevel可以传递给构造函数的值进行非阻塞读取和并发写入(默认值为 16):

Another extremely useful collection that provides this finer-grain locking is the ConcurrentHashMap. This provides the similar methods as the non–thread safe HashMap, but allows nonblocking reads and concurrent writes based on a concurrencyLevel value you can pass to the constructor (the default value is 16):

ConcurrentHashMap(int 初始容量,浮点 loadFactor,
                     int 并发级别)
ConcurrentHashMap (int initialCapacity, float loadFactor, 
                     int concurrencyLevel)

在内部,哈希表是分为单独可锁定的段,通常称为分片。锁与每个分片相关联,而不是与整个集合相关联。这意味着可以同时对集合的不同分片中的哈希表条目进行更新,从而提高性能。

Internally, the hash table is divided into individually lockable segments, often known as shards. Locks are associated with each shard rather than the whole collection. This means updates can be made concurrently to hash table entries in different shards of the collection, increasing performance.

出于性能原因,检索操作是非阻塞的,这意味着它们可以与多个并发更新重叠。这意味着检索仅反映执行检索时最近完成的更新操作的结果。

Retrieval operations are nonblocking for performance reasons, meaning they can overlap with multiple concurrent updates. This means retrievals only reflect the results of the most recently completed update operations at the time the retrieval is executed.

出于类似的原因,迭代器因为 aConcurrentHashMap是所谓的弱一致。这意味着迭代器包含哈希映射的副本,该副本反映了迭代器创建时的状态。当迭代器正在使用时,可以添加新节点并从底层哈希图中删除现有节点。然而,这些状态变化并没有反映在迭代器中。

For similar reasons, iterators for a ConcurrentHashMap are what is known as weakly consistent. This means the iterator contains a copy of the hash map that reflects its state at the time the iterator is created. While the iterator is in use, new nodes may be added and existing nodes removed from the underlying hash map. However, these state changes are not reflected in the iterator.

如果您需要一个始终反映当前哈希图状态并同时由多个线程更新的迭代器,那么就会付出性能损失,而 a 则ConcurrentHashMap不会正确的方法。这是一个优先考虑性能而不是一致性的例子——一个经典的设计权衡。

If you need an iterator that always reflects the current hashmap state while being updated by multiple threads, then there are performance penalties to pay, and a ConcurrentHashMap is not the right approach. This is an example of favoring performance over consistency—a classic design trade-off.

总结和延伸阅读

Summary and Further Reading

我将在本书的其余部分中利用本章中介绍的主要概念。线程本质上是我们用来构建可扩展分布式系统的数据处理和数据库平台的组件。在许多情况下,您可能没有明确编写多线程代码。但是,您编写的代码将在多线程环境中调用,这意味着您需要注意线程安全。许多平台还通过配置参数公开其并发性,这意味着要调整系统的性能,您需要了解更改各种线程和线程池设置的影响。基本上,在可扩展的分布式系统的世界中,并发是不可避免的。

I’ll draw upon the major concepts introduced in this chapter throughout the remainder of this book. Threads are inherently components of the data processing and database platforms that we use to build scalable distributed systems. In many cases, you may not be writing explicitly multithreaded code. However, the code you write will be invoked in a multithreaded environment, which means you need to be aware of thread safety. Many platforms also expose their concurrency through configuration parameters, meaning that to tune the system’s performance, you need to understand the effects of changing the various threading and thread pool settings. Basically, there’s no escaping concurrency in the world of scalable distributed systems.

最后,值得一提的是,虽然并发编程原语因编程语言而异,但基本问题不会改变,并且需要仔细设计多线程代码以避免竞争条件和死锁。无论您是处理C/C++ 中的pthreads,还是受 CSP 启发的经典 Go 并发模型,您需要避免的问题都是相同的。无论您使用什么语言,您从本章中获得的知识都将使您走上正确的道路。

Finally, it is worth mentioning that while concurrent programming primitives vary across programming languages, the foundational issues don’t change, and carefully designed multithreaded code to avoid race conditions and deadlocks is needed. Whether you grapple with the pthreads library in C/C++, or the classic CSP-inspired Go concurrency model, the problems you need to avoid are the same. The knowledge you have gained from this chapter will regardless set you on the right track, whatever language you are using.

本章仅简单介绍了并发性及其在 Java 中的支持。继续学习有关并发基本概念的最佳书籍是Brian Goetz 等人编写的经典Java Concurrency in Practice ( JCiP )。(艾迪生-韦斯利专业人士,2006 年)。如果您理解本书中的所有内容,您将编写出非常出色的并发代码。

This chapter has only brushed the surface of concurrency in general and its support in Java. The best book to continue learning more about the basic concepts of concurrency is the classic Java Concurrency in Practice (JCiP) by Brian Goetz et al. (Addison-Wesley Professional, 2006). If you understand everything in the book, you’ll be writing pretty great concurrent code.

自 Java 5 以来,Java 并发支持有了很大的发展。在 Java 12(或您阅读本文时的任何当前版本)的世界中,出现了 CompletableFutures、lambda 表达式和并行流等新功能。Java 8.0 中引入的函数式编程风格可以轻松创建并发解决方案,而无需直接创建和管理线程。有关 Java 8.0 功能的一个很好的知识来源是Javier Fernández González 所著的《Mastering Concurrency Programming with Java 8》 (Packt,2017 年)。

Java concurrency support has moved on considerably since Java 5. In the world of Java 12 (or whatever version is current when you read this), there are new features such as CompletableFutures, lambda expressions, and parallel streams. The functional programming style introduced in Java 8.0 makes it easy to create concurrent solutions without directly creating and managing threads. A good source of knowledge for Java 8.0 features is Mastering Concurrency Programming with Java 8 by Javier Fernández González (Packt, 2017).

其他优秀来源包括:

Other excellent sources include:

  • Doug Lea,Java 并发编程:设计原则和模式,第二版。(艾迪生-韦斯利专业人士,1996 年)

  • Doug Lea, Concurrent Programming in Java: Design Principles and Patterns, 2nd ed. (Addison-Wesley Professional, 1996)

  • Raoul-Gabriel Urma、Mario Fusco 和 Alan Mycroft,现代 Java 实践:Lambda、流、函数式和响应式编程(Manning,2019 年)

  • Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft, Modern Java in Action: Lambdas, Streams, Functional and Reactive Programming (Manning, 2019)

  • Baeldung网站收集了有关 Java 并发性的全面学习文章,并作为本章哲学家就餐示例的基础。

  • The Baeldung website has a comprehensive collection of articles for learning about Java concurrency and served as the basis for the dining philosophers example in this chapter.

1处理这些问题的正确方法,即屏障同步,是本章稍后将介绍。

1 The correct way to handle these problems, namely barrier synchronization, is covered later in this chapter.

2除了VectorHashTable,它们是遗留类;线程安全且缓慢!

2 Except Vector and HashTable, which are legacy classes; thread safe and slow!

第二部分。可扩展系统

Part II. Scalable Systems

本书第二部分的五章重点介绍扩展请求处理。涵盖的主要主题包括跨多个计算资源扩展系统、负载平衡、分布式缓存、异步消息传递和基于微服务的架构。我介绍了这些架构方法的基本概念,并通过广泛使用的分布式技术(例如 RabbitMQ 和 Google App Engine)的示例来说明它们。

The five chapters in Part II of this book focus on scaling request processing. The major topics covered include scaling out systems across multiple compute resources, load balancing, distributed caching, asynchronous messaging, and microservice-based architectures. I introduce the basic concepts of these architectural approaches and illustrate them with examples from widely used distributed technologies such as RabbitMQ and Google App Engine.

第 5 章应用程序服务

Chapter 5. Application Services

任何系统的核心都是实现应用程序需求的独特业务逻辑。在分布式系统中,它通过 API 向客户端公开,并在旨在有效支持并发远程调用的运行时环境中执行。API 及其实现包含以下基本元素应用程序支持的服务。

At the heart of any system lies the unique business logic that implements the application requirements. In distributed systems, this is exposed to clients through APIs and executed within a runtime environment designed to efficiently support concurrent remote calls. An API and its implementation comprise the fundamental elements of the services an application supports.

在本章中,我将重点讨论在应用程序中实现服务层可扩展性的相关问题。我将解释 API 和服务设计,并描述为服务提供执行环境的应用程序服务器的显着特征。我还将详细阐述我在第 2 章中简要介绍的水平扩展、负载平衡和状态管理等主题。

In this chapter, I’m going to focus on the pertinent issues for achieving scalability for the services tier in an application. I’ll explain APIs and service design and describe the salient features of application servers that provide the execution environment for services. I’ll also elaborate on topics such as horizontal scaling, load balancing, and state management that I introduced briefly in Chapter 2.

服务设计

Service Design

在最简单的情况下,应用程序包含一个面向互联网的服务,该服务将数据保存到本地数据存储,如图5-1所示。客户端通过其发布的 API 与服务交互,该 API 可通过互联网访问。

In the simplest case, an application comprises one internet facing service that persists data to a local data store, as shown in Figure 5-1. Clients interact with the service through its published API, which is accessible across the internet.

简单的服务
图 5-1。简单的服务

让我们更详细地了解 API 和服务实现。

Let’s look at the API and service implementation in more detail.

应用程序编程接口 (API)

Application Programming Interface (API)

API定义了客户端之间的契约和服务器。API 指定可能的请求类型、请求所需的数据以及将获得的结果。API 有许多不同的变体,正如我在第 3 章的 RPC/RMI 讨论中所解释的那样。虽然现代应用程序中仍然存在一些 API 多样性,但主要风格依赖于在 HTTP API 上。尽管不是特别准确,但它们通常被归类为 RESTful。

An API defines a contract between the client and server. The API specifies the types of requests that are possible, the data that is needed to accompany the requests, and the results that will be obtained. APIs have many different variations, as I explained in RPC/RMI discussions in Chapter 3. While there remains some API diversity in modern applications, the predominant style relies on HTTP APIs. These are typically, although not particularly accurately, classified as RESTful.

REST是一种架构风格罗伊·菲尔丁(Roy Fielding)在他的博士论文中定义了这一点。1 Jim Webber 等人的《REST in Practice》是有关 RESTful API 以及 Web 技术可利用的不同程度的重要知识来源。(奥莱利,2010)。在这里我简单地接触一下关于 HTTP 创建、读取、更新、删除 (CRUD) API 模式。这种模式并没有完全实现 REST 的原则,但它在当今的互联网系统中被广泛采用。它利用四个代码 HTTP 动词,即POSTGETPUTDELETE

REST is an architectural style defined by Roy Fielding in his PhD thesis.1 A great source of knowledge on RESTful APIs and the various degrees to which web technologies can be exploited is REST in Practice by Jim Webber et al. (O’Reilly, 2010). Here I’ll just briefly touch on the HTTP create, read, update, delete (CRUD) API pattern. This pattern does not fully implement the principles of REST, but it is widely adopted in internet systems today. It exploits the four code HTTP verbs, namely POST, GET, PUT, and DELETE.

CRUD API 指定客户端如何在特定业务上下文中执行创建、读取、更新和删除操作。例如,用户可以创建配置文件 ( POST)、读取目录商品 ( GET)、更新其购物车 ( PUT) 以及从订单中删除DELETE商品 ( )。

A CRUD API specifies how clients perform create, read, update, and delete operations in a specific business context. For example, a user might create a profile (POST), read catalog items (GET), update their shopping cart (PUT), and delete items from their order (DELETE).

表 5-1显示了使用这四个核心 HTTP 动词的示例滑雪胜地系统(在第 2 章中简要介绍)的 HTTP CRUD API 示例。在此示例中,参数值作为请求地址的一部分传递,并由 {} 表示法标识。

An example HTTP CRUD API for the example ski resort system (briefly introduced in Chapter 2) that uses these four core HTTP verbs is shown in Table 5-1. In this example, parameter values are passed as part of the request address and are identified by the {} notation.

表 5-1。HTTP CRUD 动词
动词 统一资源标识符 (URI) 示例 目的
POST /skico.com/滑雪者/ 创建新的滑雪者个人资料,并在 JSON 请求负载中提供滑雪者详细信息。新的滑雪者资料将在 JSON 响应中返回。
GET /skico.com/skiers/{skierID} 获取滑雪者的个人资料信息,以 JSON 响应负载返回。
PUT /skico.com/skiers/{skierID} 更新滑雪者档案。
DELETE /skico.com/skiers/{skierID} 删除滑雪者的个人资料,因为他们没有更新通行证!

可以分别在 HTTP 请求和响应正文中传递和返回其他参数值。例如,成功请求:

Additional parameter values can be passed and returned in HTTP request and response bodies, respectively. For example, a successful request to:

获取/skico.com/skiers/12345
GET /skico.com/skiers/12345

将返回 HTTP 200 响应代码以及以下 JSON 格式的结果:

will return an HTTP 200 response code and the following results formatted in JSON:

{
    “用户名”:“Ian123”,
    “电子邮件”:“i.gorton@somewhere.com”
    “城市”:“西雅图”
}
{
    "username": "Ian123",
    "email": "i.gorton@somewhere.com"
    "city": "Seattle"
}

PUT要更改滑雪者的城市,客户端可以向同一 URI发出以下请求以及表示更新的滑雪者个人资料的请求正文:

To change the skier’s city, the client could issue the following PUT request to the same URI along with a request body representing the updated skier profile:

放置 /skico.com/skiers/12345
{
    “用户名”:“Ian123”,
    “电子邮件”:“i.gorton@somewhere.com”
    “城市”:“韦纳奇”
}
PUT  /skico.com/skiers/12345
{
    "username": "Ian123",
    "email": "i.gorton@somewhere.com"
    "city": "Wenatchee"
}

更正式地说,HTTP CRUD API 将 HTTP 动词应用于由 URI 标识的资源。例如,在表 5-1中,标识滑雪者 768934 的 URI 为:

More formally, an HTTP CRUD API applies HTTP verbs on resources identified by URIs. In Table 5-1, for example, a URI that identifies skier 768934 would be:

/skico.com/skiers/768934
/skico.com/skiers/768934

对此资源的HTTPGET请求将在响应负载中返回滑雪者的完整个人资料信息,例如姓名、地址、访问天数等。如果客户端随后PUT向此 URI 发送 HTTP 请求,则我们表示更新滑雪者 768934 的资源的意图 - 在本例中它将是滑雪者的个人资料。该PUT请求将提供请求返回的滑雪者个人资料的完整表示GET。同样,这将作为请求的有效负载。有效负载通常采用 JSON 格式,但也可以使用 XML 和其他格式。如果客户端DELETE向同一 URI 发送请求,则滑雪者的个人资料将被删除。

An HTTP GET request to this resource would return the complete profile information for a skier in the response payload, such as name, address, number of days visited, and so on. If a client subsequently sends an HTTP PUT request to this URI, we are expressing the intent to update the resource for skier 768934—in this example it would be the skier’s profile. The PUT request would provide the complete representation for the skier’s profile as returned by the GET request. Again, this would be as a payload with the request. Payloads are typically formatted as JSON, although XML and other formats are also possible. If a client sends a DELETE request to the same URI, then the skier’s profile will be deleted.

因此,HTTP 动词和 URI 的组合定义了 API 操作的语义。由 URI 表示的资源在概念上类似于面向对象设计 (OOD) 中的对象或实体关系模型 (ER 模型) 中的实体。因此,资源识别和建模遵循与 OOD 和 ER 建模类似的方法。然而,重点是需要在 API 中向客户端公开的资源。“总结和延伸阅读”指出了资源设计的有用信息来源。

Hence the combination of the HTTP verb and URI define the semantics of the API operation. Resources, represented by URIs, are conceptually like objects in object-oriented design (OOD) or entities in entity–relationship model (ER model). Resource identification and modeling hence follows similar methods to OOD and ER modeling. The focus however is on resources that need to be exposed to clients in the API. “Summary and Further Reading” points to useful sources of information for resource design.

HTTP API 可以使用符号来指定称为OpenAPI。在撰写本文时,最新版本是 3.0。名为SwaggerHub的工具是 OpenAPI 中指定 API 的事实标准。该规范是在 Yet Another Markup Language (YAML) 中定义的,以下 API 定义摘录中显示了一个示例。它定义了GET对 URI 的操作/resorts。如果操作成功,则会返回 200 响应代码以及度假村列表,其格式由规范后面出现的 JSON 架构定义。如果由于某种原因,获取由其运营的度假村列表的查询skico.com未返回任何条目,则会返回 404 响应代码以及同样由 JSON 架构定义的错误消息:

HTTP APIs can be specified using a notation called OpenAPI. At the time of writing, the latest version is 3.0. A tool called SwaggerHub is the de facto standard to specify APIs in OpenAPI. The specification is defined in Yet Another Markup Language (YAML), and an example is shown in the following API definition extract. It defines the GET operation on the URI /resorts. If the operation is successful, a 200 response code is returned along with a list of resorts in a format defined by a JSON schema that appears later in the specification. If for some reason the query to get a list of resorts operated by skico.com returns no entries, a 404 response code is returned along with an error message that is also defined by a JSON schema:

路径:
  /度假村:
    得到:
      标签:
        - 度假村
      摘要:获取数据库中的滑雪场列表
      操作 ID:getResorts
      回应:
        “200”:
          描述:运行成功
          内容:
            应用程序/json:
              架构:
                $ref: '#/components/schemas/ResortsList'
        ‘404’:
          描述:未找到度假村。除非我们破产,否则不太可能
          内容:
            应用程序/json:
              架构:
                $ref: '#/components/schemas/responseMsg
paths:
  /resorts:
    get:
      tags:
        - resorts
      summary: get a list of ski resorts in the database
      operationId: getResorts
      responses:
        '200':
          description: successful operation
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ResortsList'
        '404':
          description: Resorts not found. Unlikely unless we go broke
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/responseMsg

API 设计本身就是一个复杂的主题,深入研究这个领域超出了本书的范围。然而,从可扩展性的角度来看,应该记住一些问题:

API design is a complex topic in itself and delving deeply into this area is beyond the scope of this book. From a scalability perspective, there are some issues that should, however, be borne in mind:

  • 每个 API 请求都需要一个到服务的往返,这会导致网络延迟。一种常见的反模式称为闲聊 API,其中使用多个 API 请求来执行一个逻辑操作。当 API 是按照纯面向对象的设计方法设计时,通常会发生这种情况。想象一下将单个资源属性的方法公开get()set()HTTP API。访问资源需要多个 API 请求,每个请求对应一个属性。这是不可扩展的。用于GET检索整个资源PUT并发回更新的资源。您还可以使用 HTTPPATCH动词来更新资源的各个属性。PATCH允许部分修改资源表示,与PUT用新值替换完整的资源表示。

  • Each API request requires a round trip to a service, which incurs network latency. A common antipattern is known as a chatty API, in which multiple API requests are used to perform one logical operation. This commonly occurs when an API is designed following pure object-oriented design approaches. Imagine exposing get() and set() methods for individual resource properties as HTTP APIs. Accessing a resource would require multiple API requests, one for each property. This is not scalable. Use GET to retrieve the whole resource and PUT to send back an updated resource. You can also use the HTTP PATCH verb to update individual properties of a resource. PATCH allows partial modification of a resource representation, in contrast to PUT that replaces the complete resource representation with new values.

  • 考虑对传递大负载的 HTTP API 使用压缩。所有现代 Web 服务器和浏览器都支持使用 HTTPAccept-EncodingContent-Encoding标头的压缩内容。特定的 API 请求和响应可以通过指定用于内容的压缩算法来利用这些标头,例如gzip. 压缩可以将网络带宽和延迟减少 50% 或更多。权衡成本是压缩和解压缩内容的计算周期。与节省的网络传输时间相比,这通常很小。

  • Consider using compression for HTTP APIs that pass large payloads. All modern web servers and browsers support compressed content using the HTTP Accept-Encoding and Content-Encoding headers. Specific API requests and responses can utilize these headers by specifying the compression algorithm that is used for the content—for example, gzip. Compression can reduce network bandwidth and latencies by 50% or more. The trade-off cost is the compute cycles to compress and decompress the content. This is typically small compared to the savings in network transit times.

设计服务

Designing Services

应用程序服务器容器接收请求并将它们路由到适当的处理函数来处理请求。处理程序由应用程序服务代码定义,并实现生成请求结果所需的业务逻辑。当多个同时请求到达服务实例时,通常为每个请求分配一个单独的线程上下文来执行该请求。2我将在本章后面更详细地讨论应用程序服务器中的线程处理问题。

An application server container receives requests and routes them to the appropriate handler function to process the request. The handler is defined by the application service code and implements the business logic required to generate results for the request. As multiple simultaneous requests arrive at a service instance, each is typically allocated an individual thread context to execute the request.2 The issue of thread handling in application servers is one I’ll discuss in more detail later in this chapter.

路由功能的复杂程度因技术平台和语言的不同而有很大差异。例如,在 Express.js 中,容器为与 API 签名(称为路由路径)和 HTTP 方法匹配的请求调用指定函数。下面的代码示例通过一个方法说明了这一点,当客户端发送GET对特定滑雪者配置文件的请求时,将调用该方法,如 的值所标识:skierID

The sophistication of the routing functionality varies widely by technology platform and language. For example, in Express.js, the container calls a specified function for requests that match an API signature—known as a route path—and HTTP method. The code example below illustrates this with a method that will be called when the client sends a GET request for a specific skier’s profile, as identified by the value of :skierID:

app.get('/skiers/:skierID', function (req, res) {
  // 处理GET请求
  处理请求(req.params)
})
app.get('/skiers/:skierID', function (req, res) {
  // process the GET request
  ProcessRequest(req.params)
})

在Java中,广泛使用的Spring框架提供了同样复杂的方法路由技术。它利用一组定义依赖项并实现依赖项注入的注释来简化服务代码。下面的代码片段显示了注释使用的示例:

In Java, the widely used Spring Framework provides an equally sophisticated method routing technique. It leverages a set of annotations that define dependencies and implement dependency injection to simplify the service code. The code snippet below shows an example of annotations usage:

@RestController
公共类 SkierController {
     @GetMapping("/skiers/{skierID}",
                产生=“应用程序/json”)
    公共配置文件 GetSkierProfile(@PathVariable String skierID) {
        // 为了简洁省略了数据库查询方法
        返回 GetProfileFromDB(滑雪者ID);
    }
}
@RestController
public class SkierController {
     @GetMapping("/skiers/{skierID}", 
                produces = “application/json”)
    public Profile GetSkierProfile(@PathVariable String skierID) {
        // DB query method omitted for brevity
        return GetProfileFromDB(skierID);
    }
}

这些注释提供以下功能:

These annotations provide the following functionality:

@RestController
@RestController
将类标识为控制器实现API并自动将返回对象序列化为HttpResponseAPI返回的对象
Identifies the class as a controller that implements an API and automatically serializes the return object into the HttpResponse returned from the API
@GetMapping
@GetMapping
将 API 签名映射到特定的方法,并定义响应体的格式
Maps the API signature to the specific method, and defines the format of the response body
@PathVariable
@PathVariable
将参数标识为一个值源自映射到此方法的 URI 的路径
Identifies the parameter as a value that originates in the path for a URI that maps to this method

另一种 Java 技术,JEE servlet,也提供了注释,如示例 5-1所示,但与 Spring 和其他更高级别的框架相比,这些注释很简单。该@WebServlet注释标识 URI 的基本模式,该模式应导致调用特定的 servlet。这就是/skiers我们的例子中的情况。实现 API 方法的类必须扩展包HttpServlet中的抽象类javax.servlet.http,并重写至少一个实现 HTTP 请求处理程序的方法。四个核心 HTTP 动词映射到方法,如下所示:

Another Java technology, JEE servlets, also provides annotations, as shown in Example 5-1, but these are simplistic compared to Spring and other higher-level frameworks. The @WebServlet annotation identifies the base pattern for the URI which should cause a particular servlet to be invoked. This is /skiers in our example. The class that implements the API method must extend the HttpServlet abstract class from the javax.servlet.http package and override at least one method that implements an HTTP request handler. The four core HTTP verbs map to methods as follows:

doGet
doGet
对于 HTTPGET请求
For HTTP GET requests
doPost
doPost
对于 HTTPPOST请求
For HTTP POST requests
doPut
doPut
对于 HTTPPUT请求
For HTTP PUT requests
doDelete
doDelete
对于 HTTPDELETE请求
For HTTP DELETE requests

每个方法都传递两个参数,即 和HttpServletRequest对象HttpServletResponse。小服务程序容器创建该HttpServletRequest对象,该对象包含表示传入 HTTP 请求的组件的成员。该对象包含调用的完整 URI 路径,servlet 负责显式解析和验证它,并提取路径和查询参数(如果有效)。同样,Servlet 必须使用该对象显式设置响应的属性HttpServletResponse

Each method is passed two parameters, namely an HttpServletRequest and HttpServletResponse object. The servlet container creates the HttpServletRequest object, which contains members that represent the components of the incoming HTTP request. This object contains the complete URI path for the call, and it is the servlet’s responsibility to explicitly parse and validate this, and extract path and query parameters if valid. Likewise, the servlet must explicitly set the properties of the response using the HttpServletResponse object.

因此,Servlet 需要应用程序服务程序员编写更多代码来实现。然而,与 Spring 等人更强大的注释方法相比,它们可能会提供更有效的实现,因为请求处理中涉及的生成代码“管道”更少。这是典型的性能与易用性的权衡。你会在本书中看到很多这样的内容。

Servlets therefore require more code from the application service programmer to implement. However, they are likely to provide a more efficient implementation as there is less generated code “plumbing” involved in request processing compared to the more powerful annotation approaches of Spring et al. This is a classic performance versus ease-of-use trade-off. You’ll see lots of these in this book.

例 5-1。Java Servlet 示例
导入 javax.servlet.http.*;
@WebServlet(
    名称=“SkiersServlet”,
    urlPatterns =“/滑雪者”
)
公共类 SkierServlet 扩展了 HttpServlet (

protected void doGet(HttpServletRequest 请求,   
                     HttpServletResponse 响应){  
  // 处理对 /skiers/{skierID} 的请求
  尝试 {
     // 从请求 URI 中提取 skierID(为简洁起见,未显示)
     String SkierID = getSkierIDFromRequest(request);    
     if(滑雪者ID == null) {  
        // 请求格式错误,返回错误代码
        response.setStatus(HttpServletResponse.SC_BAD_REQUEST); }
     别的 {      
        // 从数据库中读取滑雪者资料
        Profile profile = GetSkierProfile(skierID);
        // 将滑雪者个人资料作为 JSON 添加到 HTTP 响应并返回 200
        response.setContentType(“application/json”);
        response.getWriter().write(gson.toJson(Profile);
        response.setStatus(HttpServletResponse.SC_OK);
     } catch(异常前){         
         响应.setStatus
           (HttpServletResponse.SC_INTERNAL_SERVER_ERROR); }
    
       }
} }
import javax.servlet.http.*;
@WebServlet(
    name = “SkiersServlet“,
    urlPatterns = “/skiers”
)
public class SkierServlet extends HttpServlet (

protected void doGet(HttpServletRequest request,   
                     HttpServletResponse response) {  
  // handles requests to /skiers/{skierID}
  try {
     // extract skierID from the request URI (not shown for brevity)
     String skierID  = getSkierIDFromRequest(request);    
     if(skierID == null) {  
        // request was poorly formatted, return error code
        response.setStatus(HttpServletResponse.SC_BAD_REQUEST);    }
     else {      
        // read the skier profile from the database 
        Profile profile = GetSkierProfile (skierID);
        // add skier profile as JSON to HTTP response and return 200
        response.setContentType("application/json");
        response.getWriter().write(gson.toJson(Profile);
        response.setStatus(HttpServletResponse.SC_OK); 
     } catch(Exception ex) {         
         response.setStatus
           (HttpServletResponse.SC_INTERNAL_SERVER_ERROR);    }
    
       }
} }

状态管理

State Management

状态管理是一个棘手的、微妙的话题。底端路线是需要扩展的服务实现应该避免存储会话状态。

State management is a tricky, nuanced topic. The bottom line is that service implementations that need to scale should avoid storing conversational state.

这到底是什么意思?让我们首先研究一下使用 HTTP 进行状态管理的主题。

What on earth does that mean? Let’s start by examining the topic of state management with HTTP.

HTTP 被称为无状态协议。这意味着每个请求都是独立执行的,不知道同一客户端之前执行的请求。无状态意味着每个请求都需要是独立的,客户端为 Web 服务器提供足够的信息来满足请求,无论该客户端之前的活动如何。

HTTP is known as stateless protocol. This means each request is executed independently, without any knowledge of the requests that were executed before it from the same client. Statelessness implies that every request needs to be self-contained, with sufficient information provided by the client for the web server to satisfy the request regardless of previous activity from that client.

然而,情况比这个简单的描述要复杂一些。例如:

The picture is a little more complicated that this simple description portrays, however. For example:

  • 客户端和服务器之间的底层套接字连接保持打开状态,以便在客户端的多个请求之间分摊连接创建的开销。这是 HTTP/1 及更高版本的默认行为。

  • The underlying socket connection between a client and server is kept open so that the overheads of connection creation are amortized across multiple requests from a client. This is the default behavior for versions HTTP/1 and above.

  • HTTP 支持 cookies,称为HTTP 状态管理机制。光看名字就知道了,真的!

  • HTTP supports cookies, which are known as the HTTP State Management Mechanism. The name gives it away, really!

  • HTTP/2 支持流、压缩和加密,所有这些都需要状态管理。

  • HTTP/2 supports streams, compression, and encryption, all of which require state management.

那么,最初 HTTP 是无状态的,但现在也许不再是无状态的了?带着这种困惑,我将继续讨论基于 HTTP 构建的应用程序服务 API 中的状态管理。

So, originally HTTP was stateless, but perhaps not anymore? Armed with this confusion, I’ll move on to state management in application services APIs that are built on top of HTTP.

当用户或应用程序连接到服务时,它通常会发送一系列请求来检索和更新信息。会话状态代表任何信息在请求之间保留,以便后续请求可以假设服务已保留有关先前交互的知识。我将通过一个简单的例子来探讨这意味着什么。

When a user or application connects to a service, it will typically send a series of requests to retrieve and update information. Conversational state represents any information that is retained between requests such that a subsequent request can assume the service has retained knowledge about the previous interactions. I’ll explore what this means in a simple example.

在滑雪者服务 API 中,用户可以通过GET向以下 URI 提交请求来请求其个人资料:

In the skier service API, a user may request their profile by submitting a GET request to the following URI:

获取/skico.com/skiers/768934
GET /skico.com/skiers/768934

然后,他们可以使用他们的应用程序修改其city属性并发送PUT更新资源的请求:

They may then use their app to modify their city attribute and send a PUT request to update the resource:

放置 /skico.com/skiers/
{
    “用户名”:“Ian123”,
    “电子邮件”:“i.gorton@somewhere.com”
    “城市”:“韦纳奇”
}
PUT /skico.com/skiers/
{
    "username": "Ian123",
    "email": "i.gorton@somewhere.com"
    "city": "Wenatchee"
}

由于此 URI 无法识别滑雪者,因此服务必须知道要更新的资源的唯一标识符,即 768934。因此,为了使此更新操作成功,服务必须保留先前请求的会话状态GET

As this URI does not identify the skier, the service must know the unique identifier of the resource to update, namely 768934. Hence, for this update operation to succeed, the service must have retained conversational state from the previous GET request.

实施这种方法相对简单。当服务收到初始GET请求时,它会创建一个唯一标识客户端连接的会话状态对象。实际上,这通常在用户首次连接或登录服务时执行。然后,该服务可以从数据库中读取滑雪者资料,并利用会话状态对象来存储对话状态 - 在我们的示例中,这可能是skierID与滑雪者资料相关联的值。当后续PUT请求从客户端到达时,它使用会话状态对象来查找skierID与此会话关联的对象,并使用它来更新滑雪者的家乡城市。

Implementing this approach is relatively straightforward. When the service receives the initial GET request, it creates a session state object that uniquely identifies the client connection. In reality, this is often performed when a user first connects to or logs in to a service. The service can then read the skier profile from the database and utilize the session state object to store conversational state—in our example this would be skierID and likely values associated with the skier profile. When the subsequent PUT request arrives from the client, it uses the session state object to look up the skierID associated with this session and uses that to update the skier’s home city.

维护会话状态的服务称为有状态服务。从设计角度来看,有状态服务很有吸引力,因为它们可以最大限度地减少服务从数据库检索数据(状态)的次数,并减少客户端和服务之间传递的数据量。

Services that maintain conversational state are known as stateful services. Stateful services are attractive from a design perspective as they can minimize the number of times a service retrieves data (state) from the database and reduce the amount of data that is passed between clients and the services.

对于请求负载较轻的服务来说,它们具有显着的意义,并且被许多框架所推广,以使服务易于构建和部署。例如,JEE servlet 支持使用该HttpSession对象进行会话管理,并且SessionASP.NET 中的对象也提供了类似的功能。

For services with light request loads, they make eminent sense and are promoted by many frameworks to make services easy to build and deploy. For example, JEE servlets support session management using the HttpSession object, and similar capabilities are offered by the Session object in ASP.NET.

然而,当您扩展服务实现时,有状态方法就会出现问题。对于单个服务实例,您需要考虑两个问题:

As you scale the service implementations however, the stateful approach becomes problematic. For a single service instance, you have two problems to consider:

  • 如果您有多个客户端会话都维护会话状态,这将利用可用的服务内存。使用的内存量将与服务维护状态的客户端数量成正比。如果请求突然激增,您如何确定我们不会耗尽可用内存并导致服务失败?

  • If you have multiple client sessions all maintaining session state, this will utilize available service memory. The amount of memory utilized will be proportional to the number of clients the service is maintaining state for. If a sudden spike of requests arrives, how can you be certain we will not exhaust available memory and cause the service to fail?

  • 您还必须注意保持会话状态可用的时间。客户端可能会停止发送请求,但不会完全关闭其连接以允许回收状态。所有会话管理方法都支持默认会话超时。如果将此时间间隔设置为较短的时间间隔,客户端可能会发现其状态意外消失。如果您将会话超时时间设置得太长,则可能会由于资源不足而降低服务性能。

  • You also must be mindful about how long to keep session state available. A client may stop sending requests but not cleanly close their connection to allow the state to be reclaimed. All session management approaches support a default session timeout. If you set this to a short time interval, clients may see their state disappear unexpectedly. If you set the session timeout period to be too long, you may degrade service performance as it runs low on resources.

相反,无状态服务不假设先前调用的任何会话状态都已保留。该服务不应保留早期请求的任何知识,以便可以单独处理每个请求。这要求客户端提供服务处理请求并提供响应的所有必要信息。这实际上就是表 5-1中滑雪者 API 的指定方式,即:

In contrast, stateless services do not assume that any conversational state from previous calls has been preserved. The service should not maintain any knowledge from earlier requests, so that each request can be processed individually. This requires the client to provide all the necessary information for the service to process the request and provide a response. This is in fact how the skier API is specified in Table 5-1, namely:

放置 /skico.com/skiers/768934
{
    “用户名”:“Ian123”,
    “电子邮件”:“i.gorton@somewhere.com”
    “城市”:“韦纳奇”
}
PUT /skico.com/skiers/768934
{
    "username": "Ian123",
    "email": "i.gorton@somewhere.com"
    "city": "Wenatchee"
}

图 5-2显示了说明这种无状态设计的序列图。

A sequence diagram illustrating this stateless design is shown in Figure 5-2.

无状态 API 示例
图 5-2。无状态 API 示例

任何可扩展的服务都将需要无状态 API。当我在本章后面解释水平缩放时,原因就会变得清楚。目前,最重要的设计含义是,对于需要保留与客户端会话相关的状态的服务(经典的购物车示例),该状态必须存储在服务外部。这无一例外指外部数据存储。

Any scalable service will need stateless APIs. The reason why will become clear when I explain horizontal scaling later in this chapter. For now, the most important design implication is that for a service that needs to retain state pertaining to client sessions—the classic shopping cart example—this state must be stored externally to the service. This invariably means an external data store.

应用服务器

Applications Servers

应用程序服务器是可扩展应用程序的核心,托管组成应用程序的业务服务应用。它们的基本作用是接受客户端的请求,将应用程序逻辑应用于请求,并将请求结果回复给客户端。客户端可以是外部的或内部的,就像应用程序中需要使用特定服务的功能的其他服务一样。

Application servers are the heart of a scalable application, hosting the business services that compose an application. Their basic role is to accept requests from clients, apply application logic to the requests, and reply to the client with the request results. Clients may be external or internal, as in other services in the application that require to use the functionality of a specific service.

应用服务器的技术前景广泛而复杂,具体取决于您要使用的语言以及每种语言提供的特定功能。在 Java 中,Java Enterprise Edition (JEE)为应用程序服务器定义了一个全面的、功能丰富的、基于标准的平台,具有多个不同的供应商和开源实现。

The technological landscape of application servers is broad and complex, depending on the language you want to use and the specific capabilities that each offers. In Java, the Java Enterprise Edition (JEE) defines a comprehensive, feature rich, standards-based platform for application servers, with multiple different vendor and open source implementations.

在其他语言中,Express.js服务器支持 Node,Flask支持 Python,Go 中支持服务可以通过合并包来创建net/http。这些实现比 JEE 更​​加简约和轻量,通常被归类为 Web 应用程序框架。在 Java 中,Apache Tomcat 服务器是一个某种程度相当的技术。Tomcat 是一个开源实现JEE 平台的子集,即 Java servlet、JavaServer Pages (JSP)、Java 表达式语言 (EL) 和 Java WebSocket 技术。

In other languages, the Express.js server supports Node, Flask supports Python, and in Go a service can be created by incorporating the net/http package. These implementations are much more minimal and lightweight than JEE and are typically classified as web application frameworks. In Java, the Apache Tomcat server is a somewhat equivalent technology. Tomcat is an open source implementation of a subset of the JEE platform, namely the Java servlet, JavaServer Pages (JSP), Java Expression Language (EL), and Java WebSocket technologies.

图 5-3描绘了 Tomcat 的简化剖视图。Tomcat 实现了servlet 容器,它是应用程序定义的 servlet 的执行环境。Servlet 动态加载到该容器中,该容器提供生命周期管理和多线程运行时环境。

Figure 5-3 depicts a simplified view of the anatomy of Tomcat. Tomcat implements a servlet container, which is an execution environment for application-defined servlets. Servlets are dynamically loaded into this container, which provides life cycle management and a multithreaded runtime environment.

Web 应用程序服务器剖析
图 5-3。Web 应用程序服务器剖析

请求到达服务器的 IP 地址,该服务器正在侦听特定端口上的流量。例如,默认情况下,Tomcat 在端口 8080 上侦听 HTTP 请求,在端口 8443 上侦听 HTTPS 请求。传入请求由一个或多个侦听器线程处理。它们在客户端和服务器之间创建 TCP/IP 套接字连接。如果网络请求到达的频率无法被 TCP 侦听器处理,则挂起的请求将在Sockets Backlog中排队。积压的大小取决于操作系统。在大多数 Linux 版本中,默认值为 100。

Requests arrive at the IP address of the server, which is listening for traffic on specific ports. For example, by default Tomcat listens on port 8080 for HTTP requests and 8443 for HTTPS requests. Incoming requests are processed by one or more listener threads. These create a TCP/IP socket connection between the client and server. If network requests arrive at a frequency that cannot be processed by the TCP listener, pending requests are queued up in the Sockets Backlog. The size of the backlog is operating system dependent. In most Linux versions the default is 100.

建立连接后,TCP 请求将由HTTP 连接器(在本示例中)进行编组,该连接器生成Servlet 可以处理的HTTP 请求(如图5-2HttpServletRequest中的对象)。然后,HTTP 请求被分派到应用程序容器线程进行处理。

Once a connection is established, the TCP requests are marshalled by, in this example, an HTTP Connector which generates the HTTP request (HttpServletRequest object as in Figure 5-2) that the servlet can process. The HTTP request is then dispatched to an application container thread to process.

应用程序容器线程是在线程池中进行管理,本质上是一个 Java Executor,默认情况下,Tomcat 中的最小大小为 25 个线程,最大为 200 个。如果没有可用线程来处理请求,容器会将它们维护在可运行任务的队列中,一旦线程可用,就会调度它们。默认情况下,该队列的大小是Integer.MAX_VALUE无限的。3默认情况下,如果线程空闲 60 秒,则会将其杀死以释放 Java 虚拟机中的资源。

Application container threads are managed in a thread pool, essentially a Java Executor, which by default in Tomcat is a minimum size of 25 threads and a maximum of 200. If there are no available threads to handle a request, the container maintains them in a queue of runnable tasks and dispatches these as soon as a thread becomes available. This queue by default is size Integer.MAX_VALUE—that is, essentially unbounded.3 By default, if a thread remains idle for 60 seconds, it is killed to free up resources in the Java virtual machine.

对于每个请求,都会在线程中调用与 HTTP 请求对应的方法。Servlet 方法处理 HTTP 请求标头、执行业务逻辑并构造响应,该响应由容器编组回 TCP/IP 数据包并通过网络发送到客户端。

For each request, the method that corresponds with the HTTP request is invoked in a thread. The servlet method processes the HTTP request headers, executes the business logic, and constructs a response that is marshalled by the container back into a TCP/IP packet and sent over the network to the client.

在处理业务逻辑时,Servlet 经常需要查询外部数据库。这需要每个执行 servlet 方法的线程获取数据库连接并执行数据库查询。在许多数据库中,尤其是关系数据库中,连接是有限的资源,因为它们消耗客户端和数据库服务器中的内存和系统资源。因此,通常使用固定大小的数据库连接池。该池按需向请求线程分发打开的连接。

In processing the business logic, servlets often need to query an external database. This requires each thread executing the servlet methods to obtain a database connection and execute database queries. In many databases, especially relational ones, connections are limited resources as they consume memory and system resources in both the client and database server. For this reason, a fixed-size database connection pool is typically utilized. The pool hands out open connections to requesting threads on demand.

当 servlet 希望向数据库提交查询时,它会从池中请求连接。如果可用,则向 servlet 授予对连接的访问​​权限,直到它指示它已完成其工作。在此阶段,连接将返回到池中并可供另一个 servlet 使用。由于容器线程池通常大于数据库连接池,因此 servlet 可能会在没有可用连接时请求连接。为了处理这个问题,连接池维护一个请求队列,并以 FIFO 为基础分发打开的连接,队列中的线程将被阻塞,直到可用或发生超时。

When a servlet wishes to submit a query to the database, it requests a connection from the pool. If one is available, access to the connection is granted to the servlet until it indicates it has completed its work. At that stage the connection is returned to the pool and made available for another servlet to utilize. As the container thread pool is typically larger than the database connection pool, a servlet may request a connection when none are available. To handle this, the connection pool maintains a request queue and hands out open connections on a FIFO basis, and threads in the queue are blocked until there is availability or a timeout occurs.

因此,诸如 Tomcat 之类的应用程序服务器框架对于不同的工作负载具有高度可配置性。例如,可以在启动时读取的配置文件中指定线程和数据库连接池的大小。

An application server framework such as Tomcat is hence highly configurable for different workloads. For example, the size of the thread and database connection pools can be specified in configuration files that are read at startup.

完整的 Tomcat 容器环境在单个 JVM 中运行,因此处理能力受到可用 vCPU 数量和分配为堆大小的内存量的限制。每个分配的线程都会消耗内存,并且请求处理管道中的各个队列会在请求等待时消耗资源。这意味着请求响应时间将由 servlet 业务逻辑中的请求处理时间以及在队列中等待线程和连接可用的时间决定。

The complete Tomcat container environment runs within a single JVM, and hence processing capacity is limited by the number of vCPUs available and the amount of memory allocated as heap size. Each allocated thread consumes memory, and the various queues in the request-processing pipeline consume resources while requests are waiting. This means that request response time will be governed by both the request-processing time in the servlet business logic as well as the time spent waiting in queues for threads and connections to become available.

在分配了许多线程的负载较重的服务器中,上下文切换可能会开始降低性能,并且可用内存可能会变得有限。如果性能下降,队列会随着请求等待资源而增长。这会消耗更多内存。如果收到的请求多于服务器可以排队和处理的请求,则新的 TCP/IP 连接将被拒绝,并且客户端将看到错误。最终,过载的服务器可能会耗尽资源并开始抛出异常并崩溃。

In a heavily loaded server with many threads allocated, context switching may start to degrade performance, and available memory may become limited. If performance degrades, queues grow as requests wait for resources. This consumes more memory. If more requests are received than can be queued up and processed by the server, then new TCP/IP connections will be refused, and clients will see errors. Eventually, an overloaded server may run out of resources and start throwing exceptions and crash.

因此,花在调整配置参数以有效处理预期负载上的时间很少被浪费。系统在达到 100% 利用率之前,性能往往会下降。一旦任何资源(CPU 利用率、内存使用率、网络、磁盘访问等)接近完全利用率,系统就会表现出难以预测的性能。这是因为更多的时间花费在线程上下文切换和内存垃圾收集等浪费时间的任务上。这不可避免地会影响延迟和吞吐量。因此,有一个利用目标是至关重要的。这些阈值到底应该是多少,很大程度上取决于应用程序。

Consequently, time spent tuning configuration parameters to efficiently handle anticipated loads is rarely wasted. Systems tend to degrade in performance well before they reach 100% utilization. Once any resource—CPU utilization, memory usage, network, disk accesses, etc.—gets close to full utilization, systems exhibit less predictable performance. This is because more time is spent on time-wasting tasks such as thread context switching and memory garbage collecting. This inevitably affects latencies and throughput. Thus, having a utilization target is essential. Exactly what these thresholds should be is extremely application dependent.

Web 应用程序框架提供的监控工具使工程师能够收集一系列重要指标,包括延迟、活动请求、队列大小等。这些对于进行可实现性能优化的数据驱动实验非常宝贵。

Monitoring tools available with web application frameworks enable engineers to gather a range of important metrics, including latencies, active requests, queue sizes, and so on. These are invaluable for carrying out data-driven experiments that lead to performance optimization.

基于 Java 的应用程序框架(例如 Tomcat)支持Java 管理扩展 (JMX) 框架,该框架是 Java 标准版平台的标准部分。JMX 使框架能够基于 MBean 的功能公开监控信息,这些信息代表感兴趣的资源(例如,线程池、数据库连接使用情况)。这使得工具生态系统能够提供监视 JMX 支持的平台的功能。这些范围从默认情况下在 Java 开发工具包 (JDK) 中提供的JConsole到强大的开源技术(例如JavaMelody)和许多昂贵的商业产品。

Java-based application frameworks such as Tomcat support the Java Management Extensions (JMX) framework, which is a standard part of the Java Standard Edition platform. JMX enables frameworks to expose monitoring information based on the capabilities of MBeans, which represent a resource of interest (e.g., thread pool, database connections usage). This enables an ecosystem of tools to offer capabilities for monitoring JMX-supported platforms. These range from JConsole, which is available in the Java Development Kit (JDK) by default, to powerful open source technologies such as JavaMelody and many expensive commercial offerings.

水平缩放

Horizontal Scaling

扩展系统的核心原则是能够轻松添加新的处理能力来处理增加的加载。对于大多数系统,一种简单有效的方法是部署无状态服务器资源的多个实例,并使用负载均衡器在这些实例之间分配请求。这称为水平缩放,如图5-4所示。无状态服务副本和负载均衡器都是水平扩展所必需的。

A core principle of scaling a system is being able to easily add new processing capacity to handle increased load. For most systems, a simple and effective approach is deploying multiple instances of stateless server resources and using a load balancer to distribute the requests across these instances. This is known as horizontal scaling and is illustrated in Figure 5-4. Stateless service replicas and a load balancer are both necessary for horizontal scaling.

简单的负载均衡示例
图 5-4。简单的负载均衡示例

服务副本已部署在他们自己的(虚拟)硬件上。如果我们有两个副本,我们的处理能力就会加倍。如果我们有 10 个副本,我们就有可能拥有 10 倍的容量。这使我们的系统能够处理增加的负载。水平扩展的目的是创建系统处理能力,即可用总资源的总和。

Service replicas are deployed on their own (virtual) hardware. If we have two replicas, we double our processing capacity. If we have ten replicas, we have potentially 10x capacity. This enables our system to handle increased loads. The aim of horizontal scaling is to create a system-processing capacity that is the sum of the total resources available.

服务器需要是无状态的,以便任何请求都可以发送到任何服务副本来处理。该决定由负载均衡器做出,负载均衡器可以使用各种策略来分发请求。如果负载均衡器可以使每个服务副本保持同等繁忙,那么我们就可以有效地利用服务副本提供的处理能力。

The servers need to be stateless, so that any request can be sent to any service replica to handle. This decision is made by the load balancer, which can use various policies to distribute requests. If the load balancer can keep each service replica equally busy, then we are effectively using the processing capacity provided by the service replicas.

如果我们的服务是有状态的,负载均衡器需要始终将来自同一服务器的请求路由到相同的服务副本。由于客户端会话的持续时间不确定,这可能会导致某些副本的负载比其他副本重得多。这会造成不平衡,并且无法有效地跨副本均匀地使用可用容量。我将在下一节有关负载平衡的内容中更详细地讨论这个问题。

If our services are stateful, the load balancer needs to always route requests from the same server to the same service replica. As client sessions have indeterminate durations, this can lead to some replicas being much more heavily loaded than others. This creates an imbalance and is not effective in using the available capacity evenly across replicas. I’ll return to this issue in more detail in the next section on load balancing.

笔记

Spring Session 等技术Tomcat 集群平台的插件允许将会话状态外部化到 Redis 和 memcached 等通用分布式缓存中。这实际上使我们的服务成为无状态的。负载均衡器可以在所有复制的服务之间分配请求,而无需考虑状态管理。我将在第 6 章中介绍分布式缓存的主题。

Technologies like Spring Session and plugins to Tomcat’s clustering platform allow session state to be externalized in general purpose distributed caches like Redis and memcached. This effectively makes our services stateless. Load balancers can distribute requests across all replicated services without concern for state management. I’ll cover the topic of distributed caches in Chapter 6.

水平扩展还提高了可用性。对于一个服务实例,如果失败,则该服务不可用。这被称为单点故障(SPoF) ——这是一件坏事,在任何可扩展的分布式系统中都应该避免。多个副本可提高可用性。如果一个副本发生故障,请求可以定向到任何副本 - 请记住,它们是无状态的。在更换故障服务器之前,系统的容量将减少,但仍然可用。可扩展的能力至关重要,但如果系统不可用,那么有史以来最具可扩展性的系统建起来还是有点效果!

Horizontal scaling also increases availability. With one service instance, if it fails, the service is unavailable. This is known as a single point of failure (SPoF)—a bad thing, and something to avoid in any scalable distributed system. Multiple replicas increase availability. If one replica fails, requests can be directed to any replica—remember, they are stateless. The system will have reduced capacity until the failed server is replaced, but it will still be available. The ability to scale is crucial, but if a system is unavailable, then the most scalable system ever built is still somewhat ineffective!

负载均衡

Load Balancing

负载均衡旨在有效利用服务集合优化每个请求的响应时间的能力。这是通过在可用服务之间分配请求以理想地利用集体服务能力来实现的。目的是避免某些服务过载而其他服务利用率不足。

Load balancing aims to effectively utilize the capacity of a collection of services to optimize the response time for each request. This is achieved by distributing requests across the available services to ideally utilize the collective service capacity. The objective is to avoid overloading some services while underutilizing others.

客户端将请求发送到负载均衡器的 IP 地址,负载均衡器将请求重定向到目标服务,并将结果转发回客户端。这意味着客户端永远不会直接联系目标服务,这也有利于安全,因为服务可以位于安全边界后面并且不会暴露在互联网上。

Clients send requests to the IP address of the load balancer, which redirects requests to target services, and relays the results back to the client. This means clients never contact the target services directly, which is also beneficial for security as the services can live behind a security perimeter and not be exposed to the internet.

负载均衡器可能会起作用网络级别应用程序级别。它们通常分别称为第 4 层第 7 层负载均衡器。这些名称指的是开放系统互连 (OSI) 参考模型中第 4 层的网络传输层和第 7 层的应用层。OSI 模型在七个抽象层中定义了网络功能。每层都定义了数据打包和传输方式的标准。

Load balancers may act at the network level or the application level. These are often called layer 4 and layer 7 load balancers, respectively. The names refer to network transport layer at layer 4 in the Open Systems Interconnection (OSI) reference model, and the application layer at layer 7. The OSI model defines network functions in seven abstract layers. Each layer defines standards for how data is packaged and transported.

网络级负载均衡器在网络连接级别分发请求,对单个 TCP 或 UDP 数据包进行操作。路由决策是根据客户端 IP 地址做出的。一旦选择了目标服务,负载均衡器就会使用一种技术称为网络地址转换(NAT)。这会将客户端请求数据包中的目标 IP 地址从负载均衡器的地址更改为所选目标的地址。当从目标接收到响应时,负载均衡器将数据包标头中记录的源地址从目标的 IP 地址更改为自己的 IP 地址。网络负载均衡器相对简单,因为它们在单个数据包级别上运行。这意味着它们的速度非常快,因为除了选择目标服务和执行 NAT 功能之外,它们提供的功能很少。

Network-level load balancers distribute requests at the network connection level, operating on individual TCP or UDP packets. Routing decisions are made on the basis of client IP addresses. Once a target service is chosen, the load balancer uses a technique called network address translation (NAT). This changes the destination IP address in the client request packet from that of the load balancer to that of the chosen target. When a response is received from the target, the load balancer changes the source address recorded in the packet header from the target’s IP address to its own. Network load balancers are relatively simple as they operate on the individual packet level. This means they are extremely fast, as they provide few features beyond choosing a target service and performing NAT functionality.

相比之下,应用程序级负载均衡器重新组装完整的 HTTP 请求,并根据 HTTP 标头的值和消息的实际内容做出路由决策。例如,负载均衡器可以配置为将所有POST请求发送到可用服务的子集,或根据 URI 中的查询字符串分发请求。应用程序负载均衡器是复杂的反向代理。它们提供的更丰富的功能意味着它们比网络负载均衡器稍慢,但它们提供的强大功能可以用来弥补所产生的开销。

In contrast, application-level load balancers reassemble the complete HTTP request and base their routing decisions on the values of the HTTP headers and on the actual contents of the message. For example, a load balancer can be configured to send all POST requests to a subset of available services, or distribute requests based on a query string in the URI. Application load balancers are sophisticated reverse proxies. The richer capabilities they offer means they are slightly slower than network load balancers, but the powerful features they offer can be utilized to more than make up for the overheads incurred.

为了让您了解网络层和应用程序层负载均衡器之间的原始性能差异,图 5-7在一个简单的应用场景中对两者进行了比较。正在测试的负载均衡技术是AWS应用程序和网络弹性负载均衡器(ELB)。每个负载均衡器路由请求到 4 个副本之一。它们执行业务逻辑并通过负载均衡器将结果返回给客户端。客户端负载从轻负载的 32 个并发客户端到中等负载的 256 个并发客户端不等。每个客户端发送一系列请求,在接收一个请求的结果和向服务器发送下一个请求之间没有延迟。

To give you some idea of the raw performance difference between network- and application-layer load balancers, Figure 5-7 compares the two in a simple application scenario. The load balancing technology under test is the AWS Application and Network Elastic Load Balancers (ELBs). Each load balancer routes requests to one of 4 replicas. These execute the business logic and return results to the clients via the load balancer. Client load varies from a lightly loaded 32 concurrent clients to a moderate 256 concurrent clients. Each client sends a sequence of requests with no delay between receiving the results from one request and sending the next request to the server.

从图 5-5中可以看出,对于 32、64 和 128 个客户端测试,网络负载均衡器的性能平均提高了约 20%。这验证了不太复杂的网络负载均衡器预期的更高性能。对于 256 个客户端,两个负载均衡器的性能基本相同。这是因为超过了4个副本的容量,系统出现瓶颈。在此阶段,负载均衡器对系统性能没有影响。您需要向负载均衡组添加更多副本以增加系统容量,从而提高吞吐量。

You can see from Figure 5-5 that the network load balancer delivers on average around 20% higher performance for the 32, 64, and 128 client tests. This validates the expected higher performance from the less sophisticated network load balancer. For 256 clients, the performance of the two load balancers is essentially the same. This is because the capacity of the 4 replicas is exceeded and the system has a bottleneck. At this stage the load balancers make no difference to the system performance. You need to add more replicas to the load balancing group to increase system capacity, and hence throughput.

比较负载均衡器性能
图 5-5。比较负载均衡器性能4

一般来说,负载均衡器具有以下功能,将在以下部分中进行解释:

In general, a load balancer has the following features that will be explained in the following sections:

  • 负载分配策略

  • Load distribution policies

  • 健康监测

  • Health monitoring

  • 弹性

  • Elasticity

  • 会话关联性

  • Session affinity

负载分配策略

Load Distribution Policies

负载分配策略规定负载均衡器如何选择目标服务来处理请求。任何称职的负载均衡器都会提供多种负载分配策略——HAProxy 提供 10 种。以下是所有负载均衡器中最常支持的四种:

Load distribution policies dictate how the load balancer chooses a target service to process a request. Any load balancer worth its salt will offer several load distribution policies—HAProxy offers 10. The following are four of the most commonly supported across all load balancers:

循环赛
Round robin
负载均衡器将请求分发到以循环方式使用可用服务器。
The load balancer distributes requests to available servers in a round-robin fashion.
最少连接数
Least connections
负载均衡器分配新的向具有最少打开连接的服务器发出请求。
The load balancer distributes new requests to the server with the least open connections.
HTTP 标头字段
HTTP header field
负载均衡器引导请求基于特定 HTTP 标头字段的内容。例如,所有带有标头字段的请求都X-Client-Location:US,Seattle可以路由到一组特定的服务器。
The load balancer directs requests based on the contents of a specific HTTP header field. For example, all requests with the header field X-Client-Location:US,Seattle could be routed to a specific set of servers.
HTTP操作
HTTP operation
负载均衡器根据请求定向请求中的 HTTP 动词。
The load balancer directs requests based on the HTTP verb in the request.

负载均衡器也将允许服务来分配权重。例如,负载均衡池中的标准服务实例可能有 4 个 vCPU,每个 vCPU 分配的权重为 1。如果添加运行在 8 个 vCPU 上的服务副本,则可以为其分配权重 2,因此负载均衡器将发送两次尽可能多地请求它的方式。

Load balancers will also allow services to be allocated weights. For example, standard service instances in the load balancing pool may have 4 vCPUs and each is allocated a weight of 1. If a service replica running on 8 vCPUs is added, it can be assigned a weight of 2 so the load balancer will send twice as many requests its way.

健康监测

Health Monitoring

负载均衡器会定期发送ping 并尝试连接以测试负载平衡池中每个服务的运行状况。这些测试称为健康检查。如果服务变得无响应或连接尝试失败,它将从负载平衡池中删除,并且不会向该主机发送任何请求。如果与服务的连接遇到暂时性故障,则负载均衡器将在服务变得可用且正常运行后重新合并该服务。但是,如果失败,该服务将从负载均衡器目标池中删除。

A load balancer will periodically send pings and attempt connections to test the health of each service in the load balancing pool. These tests are called health checks. If a service becomes unresponsive or fails connection attempts, it will be removed from the load balancing pool and no requests will be sent to that host. If the connection to the service has experienced a transient failure, the load balancer will reincorporate the service once it becomes available and healthy. If, however, it has failed, the service will be removed from the load balancer target pool.

弹性

Elasticity

请求负载的峰值可能会导致负载均衡器可用的服务容量变得饱和,导致响应时间更长,最终导致请求和连接失败。弹性是应用程序动态提供新服务容量以处理请求增加的能力。随着负载的增加,新的副本将启动,负载均衡器会将请求定向到这些副本。随着负载减少,负载均衡器会停止不再需要的服务。

Spikes in request loads can cause the service capacity available to a load balancer to become saturated, leading to longer response times and eventually request and connection failures. Elasticity is the capability of an application to dynamically provision new service capacity to handle an increase in requests. As load increases, new replicas are started and the load balancer directs requests to these. As load decreases, the load balancer stops services that are no longer needed.

弹性要求负载均衡器与应用程序监控紧密集成,以便可以定义扩展策略来确定何时扩展和缩减。例如,策略可以指定,当所有实例的平均服务 CPU 利用率超过 70% 时,应增加服务的容量,而当平均 CPU 利用率低于 40% 时,应减少服务的容量。通常可以使用通过监控系统可用的任何指标来定义扩展策略。

Elasticity requires a load balancer to be tightly integrated with application monitoring, so that scaling policies can be defined to determine when to scale up and down. Policies may specify, for example, that capacity for a service should be increased when the average service CPU utilization across all instances is over 70%, and decreased when average CPU utilization is below 40%. Scaling policies can typically be defined using any metrics that are available through the monitoring system.

弹性负载平衡的一个示例是 AWS Auto Scaling 组。Auto Scaling 组是可用于定义了最小和最大大小的负载均衡器的服务实例的集合。负载均衡器将确保该组始终具有最少数量的可用服务,并且该组永远不会超过最大数量。该方案如图5-6所示。

An example of elastic load balancing is the AWS Auto Scaling groups. An Auto Scaling group is a collection of service instances available to a load balancer that is defined with a minimum and maximum size. The load balancer will ensure the group always has the minimum numbers of services available, and the group will never exceed the maximum number. This scheme is illustrated in Figure 5-6.

弹性负载均衡
图 5-6。弹性负载均衡

通常,有两种方法可以控制组中的副本数量。第一个基于时间表,请求负载何时增加和减少是可预测的。例如,您可能有一个在线娱乐指南,并在周四下午 6 点发布一组主要城市的周末活动。这会在周日中午之前产生更高的负载。Auto Scaling 组可以轻松配置为在周四下午 6 点提供新服务,并在周日中午将组大小减少到最小。

Typically, there are two ways to control the number of replicas in a group. The first is based on a schedule, when the request load increases and decreases are predictable. For example, you may have an online entertainment guide and publish the weekend events for a set of major cities at 6 p.m. on Thursday. This generates a higher load until Sunday at noon. An Auto Scaling group could easily be configured to provision new services at 6 p.m. Thursday and reduce the group size to the minimum at noon Sunday.

如果增加的负载峰值不可预测,则可以通过基于应用程序指标(例如平均 CPU 和内存使用情况以及队列中的消息数量)定义的扩展策略来动态控制弹性。如果超过策略的上限阈值,负载均衡器将启动一个或多个新的服务实例,直到性能低于指标阈值。实例需要时间来启动(通常是一分钟或更长时间),因此可以定义一个预热期,直到新实例被认为对组的容量做出了贡献。当观察到的指标值低于扩展策略中定义的下限阈值时,就会开始缩小缩小,实例将自动停止并从池中删除。

If increased load spikes are not predictable, elasticity can be controlled dynamically by defined scaling policies based on application metrics such as average CPU and memory usage and number of messages in a queue. If the upper threshold of the policy is exceeded, the load balancer will start one or more new service instances until performance drops below the metric threshold. Instances need time to start—often a minute or more—and hence a warm-up period can be defined until the new instance is considered to be contributing to the group’s capacity. When the observed metric value drops below the lower threshold defined in the scaling policy, scale in or scale down commences and instances will be automatically stopped and removed from the pool.

弹性是一项关键功能,它允许服务随着需求的增长而动态扩展。对于工作负载波动的高度可扩展的系统,它几乎是提供以最低成本提供必要的能力。

Elasticity is a key feature that allows services to scale dynamically as demand grows. For highly scalable systems with fluctuating workloads, it is pretty much a mandatory capability for providing the necessary capacity at minimum costs.

会话亲和性

Session Affinity

会话亲和性或粘性会话是有状态服务的负载平衡器功能。通过粘性会话,负载均衡器将来自同一客户端的所有请求发送到同一服务实例。这使得服务能够维护每个特定客户端会话的内存状态。

Session affinity, or sticky sessions, are a load balancer feature for stateful services. With sticky sessions, the load balancer sends all requests from the same client to the same service instance. This enables the service to maintain in-memory state about each specific client session.

有多种方法可以实现粘性会话。例如,HAProxy 提供了一套全面的功能,可以在服务添加、删除和故障时维护对同一服务的客户端请求。AWS Elastic Load Balancing (ELB) 生成一个 HTTP cookie,用于标识客户端会话的服务副本与....关联。该 cookie 返回给客户端,客户端必须在后续请求中发送该 cookie,以确保维持会话关联性。

There are various ways to implement sticky sessions. For example, HAProxy provides a comprehensive set of capabilities to maintain client requests on the same service in the face of service additions, removals, and failures. AWS Elastic Load Balancing (ELB) generates an HTTP cookie that identifies the service replica a client’s session is associated with. This cookie is returned to the client, which must send it in subsequent requests to ensure session affinity is maintained.

对于高度可扩展的系统来说,粘性会话可能会出现问题。它们会导致负载不平衡问题,随着时间的推移,客户端在服务之间的分布不均匀。如图 5-7所示,其中两个客户端连接到一个服务,而另一个服务保持空闲。

Sticky sessions can be problematic for highly scalable systems. They lead to a load imbalance problem, in which, over time, clients are not evenly distributed across services. This is illustrated in Figure 5-7, where two clients are connected to one service while another service remains idle.

会话粘性导致负载不平衡
图 5-7。会话粘性导致负载不平衡

由于客户端会话持续的时间不同,因此会出现负载不平衡。即使会话最初均匀分布,有些会话也会很快终止,而另一些会话则会持续存在。在轻负载系统中,这往往不是问题。然而,在一个不断创建和销毁数百万个会话的系统中,负载不平衡是不可避免的。这将导致一些服务副本未得到充分利用,而另一些服务副本则不堪重负,并可能因资源耗尽而失败。为了帮助缓解负载不平衡,负载均衡器通常提供诸如将新会话发送到连接数最少或响应时间最快的实例等策略。这些有助于引导新会话远离负载过重的服务。

Load imbalance occurs because client sessions last for varying amounts of time. Even if sessions are evenly distributed initially, some will terminate quickly while others will persist. In a lightly loaded system, this tends to not be an issue. However, in a system with millions of sessions being created and destroyed constantly, load imbalance is inevitable. This will lead to some service replicas being underutilized, while others are overwhelmed and may potentially fail due to resource exhaustion. To help alleviate load imbalance, load balancers usually provide policies such as sending new sessions to instances with the least connections or fastest response times. These help direct new sessions away from heavily loaded services.

有状态服务还有其他缺点。当服务不可避免地失败时,连接到该服务器的客户端如何恢复正在管理的状态?如果服务实例由于高负载而变慢,客户端如何响应?一般来说,有状态服务器会产生一些问题,在大型系统中这些问题可能难以设计和管理。

Stateful services have other downsides. When a service inevitably fails, how do the clients connected to that server recover the state that was being managed? If a service instance becomes slow due to high load, how do clients respond? In general, stateful servers create problems that in large scale systems can be difficult to design around and manage.

无状态服务没有这些缺点。如果失败,客户端会收到异常并重试,并将其请求路由到另一个实时服务副本。如果服务由于短暂的网络中断而速度缓慢,负载均衡器会将其从服务组中删除,直到它通过健康检查或失败。所有应用程序状态要么外部化,要么由客户端在每个请求中提供,因此负载均衡器可以轻松处理服务故障。

Stateless services have none of these downsides. If one fails, clients get an exception and retry, with their request routed to another live service replica. If a service is slow due to a transient network outage, the load balancer takes it out of the service group until it passes health checks or fails. All application state is either externalized or provided by the client in each request, so service failures can be handled easily by the load balancer.

无状态服务增强可扩展性,简化故障场景,减轻服务管理的负担。对于可扩展应用程序来说,这些优点远远超过缺点,因此它们被大多数主要的大型互联网站点(例如 Netflix)所采用

Stateless services enhance scalability, simplify failure scenarios, and ease the burden of service management. For scalable applications, these advantages far outweigh the disadvantages, and hence their adoption in most major, large-scale internet sites such as Netflix.

最后,请记住,通过负载平衡扩展一组服务可能会淹没负载平衡服务所依赖的下游服务或数据库。就像高速公路一样,如果高速公路的终点是单车道的一组红绿灯,那么在 50 英里的时间内增加 8 条车道只会造成更大的交通混乱另一方面。我确信我们都去过那里。我将在第 9 章中解决这些问题。

Finally, bear in mind that scaling one collection of services through load balancing may well overwhelm downstream services or databases that the load balanced services depend on. Just like with highways, adding eight traffic lanes for 50 miles will just cause bigger traffic chaos if the highway ends at a set of traffic lights with a one-lane road on the other side. We’ve all been there, I’m sure. I’ll address these issues in Chapter 9.

总结和延伸阅读

Summary and Further Reading

服务是可扩展软件系统的核心。他们将合同定义为 API,向客户指定他们的功能。服务在托管服务代码并将传入 API 请求路由到适当的处理逻辑的应用程序服务器容器环境中执行。应用程序服务器高度依赖于编程语言,但通常提供多线程编程模型,允许服务同时处理许多请求。如果容器线程池中的线程全部被利用,应用服务器会将请求排队,直到有线程可用。

Services are the heart of a scalable software system. They define the contract as an API that specifies their capabilities to clients. Services execute in an application server container environment that hosts the service code and routes incoming API requests to the appropriate processing logic. Application servers are highly programming language dependent, but in general provide a multithreaded programming model that allows services to process many requests simultaneously. If the threads in the container thread pool are all utilized, the application server queues up requests until a thread becomes available.

随着服务上请求负载的增长,我们可以使用负载均衡器水平扩展它,以在多个实例之间分配请求。该架构还提供高可用性,因为多服务配置意味着应用程序可以容忍单个实例的故障。服务实例由负载均衡器作为池进行管理,负载均衡器利用负载分配策略为每个请求选择目标服务副本。无状态服务允许负载均衡器简单地向响应目标重新发送请求,从而轻松扩展并简化故障场景。尽管大多数负载均衡器将使用称为粘性会话的功能来支持有状态服务,但有状态服务使负载均衡和处理故障更加复杂。因此,不建议将它们用于高度可扩展的服务。

As request loads grow on a service, we can scale it out horizontally using a load balancer to distribute requests across multiple instances. This architecture also provides high availability as the multiple-service configuration means the application can tolerate failures of individual instances. The service instances are managed as a pool by the load balancer, which utilizes a load distribution policy to choose a target service replica for each request. Stateless services scale easily and simplify failure scenarios by allowing the load balancer to simply resend requests to responsive targets. Although most load balancers will support stateful services using a feature called sticky sessions, stateful services make load balancing and handling failures more complex. Hence, they are not recommended for highly scalable services.

笔记

API 设计是一个非常复杂且充满争议的话题。Thoughtworks 博客上提供了有关基本 API 设计和资源建模的精彩概述。

API design is a topic of great complexity and debate. An excellent overview of basic API design and resource modeling is available on the Thoughtworks blog.

Java Enterprise Edition (JEE) 是一种成熟且广泛部署的服务器端技术。它具有广泛的抽象,可用于构建丰富而强大的服务。Oracle教程是了解该平台的绝佳起点。

The Java Enterprise Edition (JEE) is an established and widely deployed server-side technology. It has a wide range of abstractions for building rich and powerful services. The Oracle tutorial is an excellent starting point for appreciating this platform.

有关负载均衡器的许多知识和信息都隐藏在技术供应商提供的文档中。您选择负载均衡器,然后深入阅读手册。Tony Bourke 所著的《Server Load Balancing》 (O'Reilly,2001 年)是一个很好的资源,可以让您对负载平衡的整个领域有一个出色、广泛的了解。

Much of the knowledge and information about load balancers is buried in the documentation provided by the technology suppliers. You choose your load balancer and then dive into the manuals. For an excellent, broad perspective on the complete field of load balancing, Server Load Balancing by Tony Bourke (O’Reilly, 2001) is a good resource.

1 Roy T. Fielding, “架构风格和基于网络的软件架构的设计”。论文。加州大学欧文分校,2000 年。

1 Roy T. Fielding, “Architectural Styles and the Design of Network-Based Software Architectures”. Dissertation. University of California, Irvine, 2000.

2 Node.js 是一个值得注意的例外,因为它是单线程的。但是,它采用阻塞 I/O 的异步编程模型,支持处理许多并发请求。

2 Node.js is a notable exception here as it is single threaded. However, it employs an asynchronous programming model for blocking I/O that supports handling many simultaneous requests.

3有关默认 Tomcat 执行器配置设置,请参阅Apache Tomcat 9 配置参考。

3 See Apache Tomcat 9 Configuration Reference for default Tomcat Executor configuration settings.

4西雅图东北大学计算机科学硕士项目的 Ruijie Shaw 的实验结果。

4 Experimental results by Ruijie Xiao, from Northeastern University’s MS program in computer science in Seattle.

第 6 章分布式缓存

Chapter 6. Distributed Caching

缓存存在于应用程序的许多地方。运行应用程序的 CPU 具有快速、多级硬件缓存,可减少相对较慢的主内存访问。数据库引擎可以利用主内存来缓存数据存储的内容,以便在许多情况下查询不必接触相对较慢的磁盘。

Caches exist in many places in an application. The CPUs that run your applications have fast, multilevel hardware caches to reduce relatively slow main memory accesses. Database engines can make use of main memory to cache the contents of the data store so that in many cases queries do not have to touch relatively slow disks.

分布式缓存是一种可扩展系统的重要组成部分。缓存使昂贵的查询和计算的结果可供后续请求以低成本重用。由于不必为每个请求重建缓存的结果,系统的容量得到了增加,并且可以扩展以处理更大的工作负载。

Distributed caching is an essential ingredient of a scalable system. Caching makes the results of expensive queries and computations available for reuse by subsequent requests at low cost. By not having to reconstruct the cached results for every request, the capacity of the system is increased, and it can scale to handle greater workloads.

我将在本章中介绍两种类型的缓存。应用程序缓存需要将缓存和使用分布式缓存访问预计算结果的业务逻辑结合起来。Web 缓存利用 HTTP 协议中内置的机制来在互联网提供的基础设施内缓存结果。如果有效使用,两者都将保护您的服务和数据库免受繁重的读取流量负载。

I’ll cover two flavors of caching in this chapter. Application caching requires business logic that incorporates the caching and access of precomputed results using distributed caches. Web caching exploits mechanisms built into the HTTP protocol to enable caching of results within the infrastructure provided by the internet. When used effectively, both will protect your services and databases from heavy read traffic loads.

应用程序缓存

Application Caching

应用程序缓存旨在通过将查询和计算的结果存储在内存中来提高请求响应能力,以便后续请求可以为它们提供服务。例如,考虑一个在线报纸网站,读者可以在其中发表评论。文章一旦发布,就很少更改(如果有的话)。新评论往往会在文章发表后很快发布,但随着文章时间的推移,频率会迅速下降。因此,一篇文章可以在第一次访问时缓存,并由所有后续请求重用,直到该文章被更新、发布新评论或没有人想再阅读它。

Application caching is designed to improve request responsiveness by storing the results of queries and computations in memory so they can be subsequently served by later requests. For example, think of an online newspaper site where readers can leave comments. Once posted, articles change infrequently, if ever. New comments tend to get posted soon after an article is published, but the frequency drops quickly with the age of the article. Hence an article can be cached on first access and reused by all subsequent requests until the article is updated, new comments are posted, or no one wants to read it anymore.

一般来说,缓存可以减轻数据库的大量读取流量,因为许多查询可以直接从缓存中提供服务。它还降低了构建成本高昂的对象的计算成本,例如那些需要跨越多个不同数据库的查询的对象。最终效果是减少我们服务和数据库的计算负载,并为更多请求创造空间或容量。

In general, caching relieves databases of heavy read traffic, as many queries can be served directly from the cache. It also reduces computation costs for objects that are expensive to construct, for example, those needing queries that span several different databases. The net effect is to reduce the computational load on our services and databases and create headroom, or capacity for more requests.

缓存需要额外的资源,因此需要额外的成本来存储缓存的结果。然而,与升级数据库和服务节点以应对更高的请求负载相比,精心设计的缓存方案成本较低。作为缓存价值的体现,Twitter 大约 3% 的基础设施专门用于应用程序级缓存。在 Twitter 规模上,运行数百个集群,这相当于大量的基础设施!

Caching requires additional resources, and hence cost, to store cached results. However, well-designed caching schemes are low cost compared to upgrading database and service nodes to cope with higher request loads. As an indication of the value of caches, approximately 3% of infrastructure at Twitter is dedicated to application-level caches. At Twitter scale, operating hundreds of clusters, that is a lot of infrastructure!

应用程序级缓存漏洞专用的分布式缓存引擎。该领域的两种主要技术是memcachedRedis。两者本质上都是分布式内存哈希表,专为表示数据库查询或下游服务 API 调用结果的任意数据(字符串、对象)而设计。缓存的常见用例是存储用户会话数据、动态网页和数据库查询结果。缓存对于应用程序服务来说就像一个单一的存储,并且使用对象键上的哈希函数将对象分配到各个缓存服务器。

Application-level caching exploits dedicated distributed cache engines. The two predominant technologies in this area are memcached and Redis. Both are essentially distributed in-memory hash tables designed for arbitrary data (strings, objects) representing the results of database queries or downstream service API calls. Common use cases for caches are storing user session data, dynamic web pages, and results of database queries. The cache appears to application services as a single store, and objects are allocated to individual cache servers using a hash function on the object key.

基本方案如图6-1所示。该服务首先检查缓存以查看其所需的数据是否可用。如果是,则返回缓存内容作为结果——这称为缓存命中。如果数据不在缓存中(缓存未命中),服务将从数据库检索请求的数据并将查询结果写入缓存,以便后续客户端请求可用,而无需查询数据库。

The basic scheme is shown in Figure 6-1. The service first checks the cache to see if the data it requires is available. If so, it returns the cached contents as the results—this is known as a cache hit. If the data is not in the cache—a cache miss—the service retrieves the requested data from the database and writes the query results to the cache so it is available for subsequent client requests without querying the database.

应用级缓存
图 6-1。应用级缓存

例如,在繁忙的冬季度假胜地,滑雪者和单板滑雪者可以使用他们的移动应用程序来估计整个度假胜地的电梯等待时间。这使他们能够计划并避开拥挤的区域,因为在这些区域,他们必须等待 15 分钟或更长时间才能乘坐电梯!

For example, at a busy winter resort, skiers and snowboarders can use their mobile app to get an estimate of the lift wait times across the resort. This enables them to plan and avoid congested areas where they will have to wait to ride a lift for 15 minutes or more!

每次滑雪者装载缆车时,都会向该公司的服务发送一条消息,该服务收集有关滑雪者交通模式的数据。利用这些数据,系统可以根据乘坐缆车的滑雪者数量和到达率来估计缆车等待时间。这是一项昂贵的计算,在繁忙时间可能需要一秒钟或更长时间,因为它需要汇总可能数以万计的电梯乘坐记录并执行等待时间计算。因此,一旦计算出结果,它们就被认为在五分钟内有效。只有在这段时间过后,才会执行新的计算并生成结果。

Every time a skier loads a lift, a message is sent to the company’s service that collects data about skier traffic patterns. Using this data, the system can estimate lift wait times from the number of skiers who ride a lift and the rate they are arriving. This is an expensive calculation, taking maybe a second or more at busy times, as it requires aggregating potentially tens of thousands of lift ride records and performing the wait time calculation. For this reason, once the results are calculated, they are deemed valid for five minutes. Only after this time has elapsed is a new calculation performed and results produced.

以下代码示例显示无状态如何LiftWaitService工作。当请求到达时,服务首先检查缓存以查看是否有最新的等待时间。如果是,结果将立即返回给客户端。如果结果不在缓存中,该服务将调用下游服务,该服务执行提升等待计算并将其作为List. 然后将这些结果存储在缓存中,然后返回给客户端:

The following code example shows how a stateless LiftWaitService might work. When a request arrives, the service first checks the cache to see if the latest wait times are available. If they are, the results are immediately returned to the client. If the results are not in the cache, the service calls a downstream service which performs the lift wait calculations and returns them as a List. These results are then stored in the cache and then returned to the client:

公共类LiftWaitService {
  公共列表 getLiftWaits(字符串度假村){
    List liftWaitTimes = cache.get(“liftwaittimes:” + Resort);
      if (liftWaitTimes == null) {
         liftWaitTimes = skiCo.getLiftWaitTimes(度假村);
         // 将结果添加到缓存中,300秒后过期
         cache.put("liftwaittimes:" + Resort, liftWaitTimes, 300);
      }
    返回电梯等待时间;
     }
   }
public class LiftWaitService {
  public List getLiftWaits(String resort) { 
    List liftWaitTimes = cache.get(“liftwaittimes:” + resort); 
      if (liftWaitTimes == null) { 
         liftWaitTimes = skiCo.getLiftWaitTimes(resort); 
         // add result to cache, expire in 300 seconds 
         cache.put("liftwaittimes:" + resort, liftWaitTimes, 300); 
      } 
    return liftWaitTimes; 
     } 
   }

缓存访问需要密钥与结果相关联。liftwaittimes:”在此示例中,密钥由与客户端传递给服务的度假村标识符连接的字符串构成。然后,缓存对该密钥进行哈希处理,以识别缓存值所在的服务器。

Cache access requires a key with which to associate the results. In this example, the key is constructed with the string liftwaittimes:” concatenated with the resort identifier that is passed by the client to the service. This key is then hashed by the cache to identify the server where the cached value resides.

当写入新值时缓存中,300 秒的值作为参数传递给操作put。这称为生存时间(TTL) 值。它告诉缓存,300 秒后,该键值对应从缓存中逐出,因为该值不再是当前值(也称为陈旧值)。

When a new value is written to the cache, a value of 300 seconds is passed as a parameter to the put operation. This is known as a time to live (TTL) value. It tells the cache that after 300 seconds this key-value pair should be evicted from the cache as the value is no longer current (also known as stale).

当缓存值有效时,所有请求都将使用它。这意味着无需为每个呼叫执行昂贵的电梯等待时间计算。快速网络上的缓存命中可能需要一毫秒,比提升等待时间计算快得多。当缓存值在 300 秒后被逐出时,下一个请求将导致缓存未命中。这将导致计算要存储在缓存中的新值。因此,如果我们在 5 分钟内收到N 个请求,则将从缓存中处理N -1 个请求。想象一下,如果N是 10,000?这样可以节省大量昂贵的计算,并且数据库可以使用 CPU 周期来处理其他查询。

While the cache value is valid, all requests will utilize it. This means there is no need to perform the expensive lift wait time calculation for every call. A cache hit on a fast network will take maybe a millisecond—much faster than the lift wait times calculation. When the cache value is evicted after 300 seconds, the next request will result in a cache miss. This will result in the calculation of the new values to be stored in the cache. Therefore, if we get N requests in a 5-minute period, N-1 requests are served from the cache. Imagine if N is 10,000? This is a lot of expensive calculations saved, and CPU cycles that your database can use to process other queries.

使用 TTL 等过期时间是使缓存内容无效的常见方法。它确保服务不会向客户端提供陈旧、过时的结果。它还使系统能够对缓存内容进行一些控制,而缓存内容通常是有限的。如果缓存的项目没有定期刷新,缓存就会被填满。在这种情况下,缓存将采用诸如最近最少使用最少访问之类的策略来选择缓存条目来逐出并为更新的、及时的结果创建空间。

Using an expiry time like the TTL is a common way to invalidate cache contents. It ensures a service doesn’t deliver stale, out-of-date results to a client. It also enables the system to have some control over cache contents, which are typically limited. If cached items are not flushed periodically, the cache will fill up. In this case, a cache will adopt a policy such as least recently used or least accessed to choose cache entries to evict and create space for more current, timely results.

应用程序缓存可以显着提高吞吐量、减少延迟并提高客户端应用程序响应能力。实现这些理想品质的关键是满足缓存中尽可能多的请求。总体设计原则是最大化缓存命中率并最小化缓存未命中率。当发生缓存未命中时,必须通过查询数据库或下游服务来满足请求。然后,请求的结果可以写入缓存,从而可供进一步访问。

Application caching can provide significant throughput boosts, reduced latencies, and increased client application responsiveness. The key to achieving these desirable qualities is to satisfy as many requests as possible from the cache. The general design principle is to maximize the cache hit rate and minimize the cache miss rate. When a cache miss occurs, the request must be satisfied through querying databases or downstream services. The results of the request can then be written to the cache and hence be available for further accesses.

对于缓存命中率应该是多少,没有硬性规定,因为它取决于构建缓存内容的成本和缓存项的更新率。理想的缓存设计的读取次数远多于更新次数。这是因为当必须更新某个项目时,应用程序需要使因更新而过时的缓存条目无效。这意味着下一个请求将导致缓存未命中。1

There’s no hard-and-fast rule on what the cache hit rate should be, as it depends on the cost of constructing the cache contents and the update rate of cached items. Ideal cache designs have many more reads than updates. This is because when an item must be updated, the application needs to invalidate cache entries that are now stale because of the update. This means the next request will result in a cache miss.1

当项目定期更新时,缓存未命中的成本可能会抵消缓存的优势。因此,服务设计者需要仔细考虑应用程序体验的查询和更新模式,并构建产生最大效益的缓存机制。一旦服务投入生产,监控缓存使用情况也很重要,以确保命中率和未命中率符合设计预期。缓存将提供管理实用程序和 API,以监控缓存使用特征。例如,memcached 提供大量可用统计信息,包括命中和未命中计数,如下面的输出片段所示:

When items are updated regularly, the cost of cache misses can negate the benefits of the cache. Service designers therefore need to carefully consider query and update patterns an application experiences, and construct caching mechanisms that yield the most benefit. It is also crucial to monitor the cache usage once a service is in production to ensure the hit and miss rates are in line with design expectations. Caches will provide both management utilities and APIs to enable monitoring of the cache usage characteristics. For example, memcached makes a large number of statistics available, including the hit and miss counts as shown in the snippet of output below:

统计 get_hits 98567
统计 get_misss 11001
STAT 驱逐 0
STAT get_hits 98567
STAT get_misses 11001
STAT evictions 0

应用程序级缓存也称为缓存端模式。名字引用这样一个事实:如果所需的结果在缓存中可用,则应用程序代码可以有效地绕过数据存储系统。这与应用程序始终读取和写入缓存的其他缓存模式形成鲜明对比。这些被称为read-throughwrite-throughwrite-behind缓存,定义如下:

Application-level caching is also known as the cache-aside pattern. The name references the fact that the application code effectively bypasses the data storage systems if the required results are available in the cache. This contrasts with other caching patterns in which the application always reads from and writes to the cache. These are known as read-through, write-through, and write-behind caches, defined as follows:

通读
Read-through
该应用程序满足所有请求通过访问缓存。如果所需的数据在缓存中不可用,则调用加载器来访问数据系统并将结果加载到缓存中以供应用程序使用。
The application satisfies all requests by accessing the cache. If the data required is not available in the cache, a loader is invoked to access the data systems and load the results in the cache for the application to utilize.
直写式
Write-through
应用程序总是写更新到缓存。当缓存更新时,调用写入器将新的缓存值写入数据库。当数据库更新时,应用程序可以完成请求。
The application always writes updates to the cache. When the cache is updated, a writer is invoked to write the new cache values to the database. When the database is updated, the application can complete the request.
后写式
Write-behind
类似于直写式,除了应用程序不会等待将值从缓存写入数据库。如果缓存服务器在数据库更新完成之前崩溃,这会提高请求响应能力,但代价是可能会丢失更新。这也称为回写式缓存,并且是大多数数据库引擎内部使用的策略。
Like write-through, except the application does not wait for the value to be written to the database from the cache. This increases request responsiveness at the expense of possible lost updates if the cache server crashes before a database update is completed. This is also known as a write-back cache, and internally is the strategy used by most database engines.

这些缓存方法的优点在于它们简化了应用程序逻辑。应用程序始终利用缓存进行读取和写入,而缓存提供了“魔力”来确保缓存与后端存储系统正确交互。这与缓存端模式形成对比,在缓存端模式中,应用程序逻辑必须识别缓存未命中。

The beauty of these caching approaches is that they simplify application logic. Applications always utilize the cache for reads and writes, and the cache provides the “magic” to ensure the cache interacts appropriately with the backend storage systems. This contrasts with the cache-aside pattern, in which application logic must be cognizant of cache misses.

直读、直写和后写策略需要缓存技术,该技术可以通过应用程序特定的处理程序进行增强,以便在应用程序访问缓存时执行数据库读写操作。例如,NCache支持应用程序实现的提供者接口。这些会在直读式缓存的缓存未命中和直写式缓存的写入时自动调用。其他此类缓存本质上是专用数据库缓存,因此要求缓存访问与底层数据库模型相同。Amazon 的DynamoDB Accelerator (DAX)就是一个例子。DAX 位于应用程序代码和 DynamoDB 之间,透明地充当高速内存缓存,以减少数据库访问时间。

Read-through, write-through, and write-behind strategies require a cache technology that can be augmented with an application-specific handler to perform database reads and writes when the application accesses the cache. For example, NCache supports provider interfaces that the application implements. These are invoked automatically on cache misses for read-through caches and on writes for write-through caches. Other such caches are essentially dedicated database caches, and hence require cache access to be identical to the underlying database model. An example of this is Amazon’s DynamoDB Accelerator (DAX). DAX sits between the application code and DynamoDB, and transparently acts as a high-speed, in-memory cache to reduce database access times.

缓存旁路策略的一个显着优点是它能够适应缓存故障。在缓存不可用的情况下,所有请求本质上都被视为缓存未命中。性能会受到影响,但服务仍然能够满足请求。此外,由于 Redis 和memcached等缓存平台的简单、分布式哈希表模型,扩展缓存平台也很简单。由于这些原因,缓存旁路模式是大规模应用中的主要方法。可扩展的系统。

One significant advantage of the cache-aside strategy is that it is resilient to cache failure. In circumstances when the cache is unavailable, all requests are essentially handled as a cache miss. Performance will suffer, but services will still be able to satisfy requests. In addition, scaling cache-aside platforms such as Redis and mem­cached is straightforward due to their simple, distributed hash table model. For these reasons, the cache-aside pattern is the primary approach seen in massively scalable systems.

网页缓存

Web Caching

网站如此高的原因之一响应式的特点是互联网上到处都是网络缓存。Web 缓存在定义的时间段内存储给定资源(例如网页或图像)的副本。缓存拦截客户端请求,如果它们在本地缓存了所请求的资源,它们会返回副本,而不是将请求转发到目标服务。因此,可以满足许多请求而不会给服务带来负担。此外,由于缓存在物理上更接近客户端,因此请求的延迟会更低。

One of the reasons that websites are so highly responsive is that the internet is littered with web caches. Web caches store a copy of a given resource—for example, a web page or an image, for a defined time period. The caches intercept client requests and if they have a requested resource cached locally, they return the copy rather than forwarding the request to the target service. Hence, many requests can be satisfied without placing a burden on the service. Also, as the caches are physically closer to the client, the requests will have lower latencies.

图 6-2给出了 Web 缓存架构的概述。存在多个级别的缓存,从客户端的 Web 浏览器缓存和基于本地组织的缓存开始。ISP 还将实现通用 Web 代理缓存,并且可以在应用程序服务执行域内部署反向代理缓存。Web 浏览器缓存也称为私有缓存(针对单个用户)。组织和 ISP 代理缓存是支持多个用户请求的共享缓存。

Figure 6-2 gives an overview of the web caching architecture. Multiple levels of caches exist, starting with the client’s web browser cache and local organization-based caches. ISPs will also implement general web proxy caches, and reverse proxy caches can be deployed within the application services execution domain. Web browser caches are also known as private caches (for a single user). Organizational and ISP proxy caches are shared caches that support requests from multiple users.

互联网中的网络缓存
图 6-2。互联网中的网络缓存

边缘缓存,也称为内容交付网络 (CDN),位于全球各个战略地理位置,因此他们将经常访问的数据缓存在靠近客户端的地方。例如,视频流提供商可以在澳大利亚悉尼配置边缘缓存,为澳大利亚用户提供视频内容,而不是从位于美国的源服务器跨越太平洋流式传输内容。边缘缓存由 CDN 提供商在全球部署。Akamai 是最初的 CDN 提供商,拥有 2,000 多个地点,提供全球高达 30% 的互联网流量。对于拥有全球用户的富媒体网站来说,边缘缓存至关重要。

Edge caches, also known as content delivery networks (CDNs), live at various strategic geographical locations globally, so that they cache frequently accessed data close to clients. For example, a video streaming provider may configure an edge cache in Sydney, Australia to serve video content to Australasian users rather than streaming content across the Pacific Ocean from US-based origin servers. Edge caches are deployed globally by CDN providers. Akamai, the original CDN provider, has over 2,000 locations and delivers up to 30% of internet traffic globally. For media-rich sites with global users, edge caches are essential.

缓存通常只存储请求的结果GET,缓存键是关联的 URI GET。当客户端发送GET请求时,请求路径上可能会被一个或多个缓存拦截。任何具有所请求资源的最新副本的缓存都可以响应该请求。如果未找到缓存内容,则该请求将由服务端点(在 Web 技术术语中也称为源服务器)提供服务。

Caches typically store the results of GET requests only, and the cache key is the URI of the associated GET. When a client sends a GET request, it may be intercepted by one or more caches along the request path. Any cache with a fresh copy of the requested resource may respond to the request. If no cached content is found, the request is served by the service endpoint, which is also called the origin server in web technology parlance.

服务可以控制结果使用 HTTP 缓存指令进行缓存以及存储多长时间。服务在各种 HTTP 响应标头中设置这些指令,如以下简单示例所示:

Services can control what results are cached and for how long they are stored by using HTTP caching directives. Services set these directives in various HTTP response headers, as shown in this simple example:

回复:
HTTP/1.1 200 OK 内容长度:9842
内容类型:application/json
缓存控制:公共
日期:2019 年 3 月 26 日星期五 09:33:49 GMT
到期时间:2019 年 3 月 26 日星期五 09:38:49 GMT
Response:
HTTP/1.1 200 OK Content-Length: 9842
Content-Type: application/json 
Cache-Control: public 
Date: Fri, 26 Mar 2019 09:33:49 GMT 
Expires: Fri, 26 Mar 2019 09:38:49 GMT

我将在以下小节中描述这些指令。

I will describe these directives in the following subsections.

缓存控制

Cache-Control

HTTPCache-Control标头可由客户端请求和服务响应使用来指定如何将缓存用于感兴趣的资源。可能的值为:

The Cache-Control HTTP header can be used by client requests and service responses to specify how the caching should be utilized for the resources of interest. Possible values are:

no-store
no-store
指定不应缓存请求响应中的资源。这通常用于需要从每个请求的源服务器检索的敏感数据。
Specifies that a resource from a request response should not be cached. This is typically used for sensitive data that needs to be retrieved from the origin servers each request.
no-cache
no-cache
指定缓存资源在使用前必须通过源服务器重新验证。我在“Etag”部分讨论重新验证。
Specifies that a cached resource must be revalidated with an origin server before use. I discuss revalidation in the section “Etag”.
private
private
指定资源只能由特定于用户的设备(例如 Web 浏览器)缓存。
Specifies a resource can be cached only by a user-specific device such as a web browser.
public
public
指定资源可以由任何代理服务器缓存。
Specifies a resource can be cached by any proxy server.
max-age
max-age
定义资源的缓存副本应保留的时间长度(以秒为单位)。过期后,缓存必须通过向源服务器发送请求来刷新资源。
Defines the length of time in seconds a cached copy of a resource should be retained. After expiration, a cache must refresh the resource by sending a request to the origin server.

过期和上次修改时间

Expires and Last-Modified

和HTTPExpiresLast-Modified头与max-age指令交互控制缓存数据的保留时间。

The Expires and Last-Modified HTTP headers interact with the max-age directive to control how long cached data is retained.

缓存的存储资源有限,因此必须定期从内存中逐出项目以创建空间。为了影响缓存驱逐,服务可以指定缓存中的资源应保持有效或新鲜的时间。当对新资源的请求到达时,缓存将提供本地存储的结果,而无需联系源服务器。一旦缓存资源的任何指定保留期到期,它就会变得陈旧并成为驱逐的候选者。

Caches have limited storage resources and hence must periodically evict items from memory to create space. To influence cache eviction, services can specify how long resources in the cache should remain valid, or fresh. When a request arrives for a fresh resource, the cache serves the locally stored results without contacting the origin server. Once any specified retention period for a cached resource expires, it becomes stale and becomes a candidate for eviction.

新鲜度是使用标头值的组合来计算的。标"Cache-Control: max-age=N"头是主要指令,该值指定新鲜期(以秒为单位)。

Freshness is calculated using a combination of header values. The "Cache-Control: max-age=N" header is the primary directive, and this value specifies the freshness period in seconds.

如果max-age未指定,则Expires接下来检查标头。如果该标头存在,则用于计算保鲜期。标Expires头指定明确的日期和时间,超过该日期和时间后资源应被视为过时。例如:

If max-age is not specified, the Expires header is checked next. If this header exists, then it is used to calculate the freshness period. The Expires header specifies an explicit date and time after which the resource should be considered stale. For example:

到期时间:2022 年 10 月 26 日星期三 09:39:00 GMT
Expires: Wed, 26 Oct 2022 09:39:00 GMT

作为最后的手段,Last-Modified标头可用于计算资源保留期。该标头由源服务器设置,以指定资源上次更新的时间,并使用与标头相同的格式Expires。缓存服务器可以Last-Modified根据缓存支持的启发式计算来确定资源的新鲜度生命周期。该计算使用Date标头,该标头指定从源服务器发送响应消息的时间。资源保留期随后变得等于Date标头的值减去标头的值Last-Modified除以10。

As a last resort, the Last-Modified header can be used to calculate resource retention periods. This header is set by the origin server to specify when a resource was last updated, and uses the same format as the Expires header. A cache server can use Last-Modified to determine the freshness lifetime of a resource based on a heuristic calculation that the cache supports. The calculation uses the Date header, which specifies the time a response message was sent from an origin server. A resource retention period subsequently becomes equal to the value of the Date header minus the value of the Last-Modified header divided by 10.

埃塔格

Etag

HTTP 提供了另一个指令可用于控制缓存项的新鲜度。这被称为Etag. AnEtag是一个不透明值,Web 缓存可以使用它来检查缓存的资源是否仍然有效。我将在下面使用示例来解释这一点。

HTTP provides another directive that can be used to control cache item freshness. This is known as an Etag. An Etag is an opaque value that can be used by a web cache to check if a cached resource is still valid. I’ll explain this using an example in the following.

回到我们的冬季度假村示例,冬季度假村每天早上 6 点都会生成天气预报。如果白天天气发生变化,度假村会更新报告。有时这种情况每天会发生两到三次,有时如果天气稳定则不会发生。当天气报告请求到达时,服务会响应定义缓存新鲜度的最大期限,以及代表Etag上次发布的天气报告版本的时间。下面的 HTTP 示例显示了这一点,该示例告诉缓存将天气预报资源视为最新资源至少 3,600 秒(即 60 分钟)。该Etag值(即"blackstone-weather-03/26/19-v1")是使用服务为此特定资源定义的标签简单生成的。在此示例中,Etag代表 2019 年 3 月 26 日 Blackstone Resort 报告的第一版。其他常见策略是Etag使用 MD5 等哈希算法生成:

Going back to our winter resort example, the resort produces a weather report at 6 a.m. every day during the winter season. If the weather changes during the day, the resort updates the report. Sometimes this happens two or three times each day, and sometimes not at all if the weather is stable. When a request arrives for the weather report, the service responds with a maximum age to define cache freshness, and also an Etag that represents the version of the weather report that was last issued. This is shown in the following HTTP example, which tells a cache to treat the weather report resource as fresh for at least 3,600 seconds, or 60 minutes. The Etag value, namely "blackstone-weather-03/26/19-v1", is simply generated using a label that the service defines for this particular resource. In this example, the Etag represents the first version of the report for the Blackstone Resort on March 26th, 2019. Other common strategies are to generate the Etag using a hash algorithm such as MD5:

要求:
获取 /skico.com/weather/Blackstone

回复:
HTTP/1.1 200 OK 内容长度:...
内容类型:application/json
日期:2019 年 3 月 26 日星期五 09:33:49 GMT
缓存控制:公共,最大年龄=3600
ETag:“blackstone-weather-03/26/19-v1”
<!-- 内容省略-->
Request:
GET /skico.com/weather/Blackstone

Response:
HTTP/1.1 200 OK Content-Length: ...
Content-Type: application/json 
Date: Fri, 26 Mar 2019 09:33:49 GMT 
Cache-Control: public, max-age=3600 
ETag: “blackstone-weather-03/26/19-v1"
<!-- Content omitted -->

在接下来的一个小时内,网络缓存仅向所有发出请求的客户端提供缓存的天气报告GET。这意味着源服务器无需处理这些请求,这就是我们希望从有效缓存中获得的结果。但一小时后,资源就会变得陈旧。现在,当对过时资源的请求到达时,缓存将其转发到源服务器,If-None-Match并带有指令和Etag查询资源(在我们的例子中是天气预报)是否仍然有效。这称为重新验证

For the next hour, the web cache simply serves this cached weather report to all clients who issue a GET request. This means the origin servers are freed from processing these requests—the outcome that we want from effective caching. After an hour though, the resource becomes stale. Now, when a request arrives for a stale resource, the cache forwards it to the origin server with a If-None-Match directive along with the Etag to inquire if the resource, in our case the weather report, is still valid. This is known as revalidation.

对此请求有两种可能的响应:

There are two possible responses to this request:

  • 如果Etag请求中的 与服务中资源关联的值匹配,则缓存的值仍然有效。因此,源服务器可以返回304 (Not Modified)响应,如以下示例所示。不需要响应正文,因为缓存的值仍然是最新的,从而节省带宽,特别是对于大型资源。响应还可能包括新的缓存指令以更新缓存资源的新鲜度。

  • If the Etag in the request matches the value associated with the resource in the service, the cached value is still valid. The origin server can therefore return a 304 (Not Modified) response, as shown in the following example. No response body is needed as the cached value is still current, thus saving bandwidth, especially for large resources. The response may also include new cache directives to update the freshness of the cached resource.

  • 源服务器可能会忽略重新验证请求并使用200 OK响应代码、响应正文并Etag代表最新版本的天气预报进行响应:

  • The origin server may ignore the revalidation request and respond with a 200 OK response code, a response body and Etag representing the latest version of the weather report:

要求:
获取 /upic.com/weather/Blackstone
如果-无-匹配:“blackstone-weather-03/26/19-v1”
回复:
HTTP/1.1 304 未修改
缓存控制:公共,最大年龄=3600
Request: 
GET /upic.com/weather/Blackstone 
If-None-Match: “blackstone-weather-03/26/19-v1"
Response:
HTTP/1.1 304 Not Modified
Cache-Control: public, max-age=3600

在服务实现中,需要有一种机制来支持重新验证。在我们的天气预报示例中,一种策略如下:

In the service implementation, a mechanism is needed to support revalidation. In our weather report example, one strategy is as follows:

生成新的每日报告
Generate a new daily report
天气报告被构建并存储在数据库中,并以Etag为属性。
The weather report is constructed and stored in a database, with the Etag as an attribute.
GET要求
GET requests
当任何GET请求到达时,该服务都会返回天气预报和Etag. 这还将沿着网络响应路径填充 Web 缓存。
When any GET request arrives, the service returns the weather report and the Etag. This will also populate web caches along the network response path.
有条件的GET请求
Conditional GET requests
对于有条件的请求 If-None-Match:Etag指令,在数据库中查找值,304如果值没有改变则返回。如果存储的值Etag已更改,则返回200最新的天气报告和新Etag值。
For conditional requests with the If-None-Match: directive, look up the Etag value in the database and return 304 if the value has not changed. If the stored Etag has changed, return 200 along with the latest weather report and a new Etag value.
更新天气预报
Update the weather report
新版本的天气报告存储在数据库中,并且值Etag被修改以表示新版本的响应。
A new version of the weather report is stored in the database and the Etag value is modified to represent this new version of the response.

如果有效使用,网络缓存可以显着降低延迟并节省网络带宽。对于图像和文档等大型项目尤其如此。此外,由于 Web 缓存处理请求而不是应用程序服务,因此减少了源服务器上的请求负载,从而创建了额外的容量。

When used effectively, web caching can significantly reduce latencies and save network bandwidth. This is especially true for large items such as images and documents. Further, as web caches handle requests rather than application services, this reduces the request load on origin servers, creating additional capacity.

SquidVarnish等代理缓存广泛部署在互联网上。当针对静态数据(图像、视频和音频流)以及不经常更改的数据(例如天气预报)部署 Web 缓存时,Web 缓存最为有效。因此,HTTP 缓存与代理和边缘缓存结合提供的强大功能用于构建可扩展应用程序的宝贵工具。

Proxy caches such as Squid and Varnish are extensively deployed on the internet. Web caching is most effective when deployed for static data (images, videos, and audio streams) as well as infrequently changing data such as weather reports. The powerful facilities provided by HTTP caching in conjunction with proxy and edge caches are therefore invaluable tools for building scalable applications.

总结和延伸阅读

Summary and Further Reading

缓存是任何可扩展发行版的重要组成部分。缓存将许多客户端请求的信息存储在内存中,并将该信息作为客户端请求的结果。虽然信息仍然有效,但可以提供数百万次,而无需重新创建成本。

Caching is an essential component of any scalable distribution. Caching stores information that is requested by many clients in memory and serves this information as the results to client requests. While the information is still valid, it can be served potentially millions of times without the cost of re-creation.

使用分布式缓存的应用程序缓存是可扩展系统中最常见的缓存方法。此方法要求应用程序逻辑在客户端请求到达时检查缓存的值,并在可用时返回这些值。如果缓存命中率很高,大多数请求都能满足缓存结果,那么后端服务和数据库的负载就可以大大减少。

Application caching using a distributed cache is the most common approach to caching in scalable systems. This approach requires the application logic to check for cached values when a client request arrives and return these if available. If the cache hit rate is high, with most requests being satisfied with cached results, the load on backend services and databases can be considerably reduced.

互联网还具有内置的多级缓存基础设施。应用程序可以通过使用作为 HTTP 标头一部分的缓存指令来利用这一点。这些指令使服务能够指定可以缓存哪些信息、应该缓存多长时间,并采用协议来检查过时的缓存条目是否仍然有效。如果使用得当,HTTP 缓存可以显着减少下游服务和数据库的请求负载。

The internet also has a built in, multilevel caching infrastructure. Applications can exploit this through the use of cache directives that are part of HTTP headers. These directives enable a service to specify what information can be cached, for how long it should be cached, and employ a protocol for checking to see if a stale cache entry is still valid. Used wisely, HTTP caching can significantly reduce request loads on downstream services and databases.

缓存是软件和系统中一个成熟的领域,文献往往分散在许多通用和特定于产品的来源中。Gerardus Blokdyk 的Memcached,第 3 版是“万物缓存”的重要来源。(5StarCooks,2021 年)。虽然标题给出了以产品为中心的内容,但所包含的知识可以轻松地转化为具有其他竞争技术的缓存设计。

Caching is a well established area of software and systems, and the literature tends to be scattered across many generic and product-specific sources. A great source of “all things caching” is Gerardus Blokdyk’s Memcached, 3rd ed. (5StarCooks, 2021). While the title gives away the product-focused content, the knowledge contained can be translated easily to cache designs with other competing technologies.

关于 HTTP/2 的重要信息来源是Stephen Ludin 和 Javier Garza 撰写的《学习 HTTP/2:初学者实用指南》 (O'Reilly,2017 年)。尽管已经过时, Duane Wessels 的《Web Caching》(O'Reilly,2001)包含了足够的通用智慧,仍然是非常有用的参考。

A great source of information on HTTP/2 in general is Learning HTTP/2: A Practical Guide for Beginners by Stephen Ludin and Javier Garza (O’Reilly, 2017). And while dated, Web Caching by Duane Wessels (O’Reilly, 2001) contains enough generic wisdom to remain a very useful reference.

CDN 本身就是一个复杂的、特定于供应商的主题。它们适合具有丰富媒体的网站,这些网站的用户群体分布在不同的地区,需要快速的内容交付。要获得易读的 CDN 概述,Ogi Djuraskovic 的网站值得一看。

CDNs are a complex, vendor-specific topic in themselves. They come into their own for media-rich websites with a geographically dispersed group of users that require fast content delivery. For a highly readable overview of CDNs, Ogi Djuraskovic’s site is worth checking out.

1某些应用程序用例可能会在进行更新的同时创建新的缓存条目。如果某些键很“热”并且很可能在下次更新之前再次访问,这可能会很有用。这称为“急切”缓存更新。

1 Some application use cases may make it possible for a new cache entry to be created at the same time an update is made. This can be useful if some keys are “hot” and will have a great likelihood of being accessed again before the next update. This is known as an “eager” cache update.

第 7 章异步消息传递

Chapter 7. Asynchronous Messaging

对于一本分布式系统书籍,我不可避免地在前面的章节中花了相当多的时间讨论通信问题。通信是分布式系统的基础,也是架构师需要将其纳入系统设计的一个主要问题。

Inevitably for a distributed systems book, I’ve spent a fair bit of time in the preceding chapters discussing communications issues. Communication is fundamental to distributed systems, and it is a major issue that architects need to incorporate into their system designs.

到目前为止,这些讨论已假定采用同步消息传递风格。客户端发送响应并等待服务器响应。这就是大多数分布式通信的设计方式,因为客户端需要即时响应才能继续。

So far, these discussions have assumed a synchronous messaging style. A client sends a response and waits for a server to respond. This is how most distributed communications are designed to occur, as the client requires an instantaneous response to proceed.

并非所有系统都有此要求。例如,当我退回一些网上购买的商品时,我会将它们带到当地的 UPS 或 FedEx 商店。他们扫描我的二维码,然后我把包裹交给他们处理。然后,我不会在商店等待确认供应商已成功收到产品并且我的付款已退回。那将是乏味且低效的。我相信运输服务会将我不需要的货物运送给供应商,并希望在几天后收到处理完毕的消息。

Not all systems have this requirement. For example, when I return some goods I’ve purchased online, I take them to my local UPS or FedEx store. They scan my QR code, and I give them the package to process. I do not then wait in the store for confirmation that the product has been successfully received by the vendor and my payment returned. That would be dull and unproductive. I trust the shipping service to deliver my unwanted goods to the vendor and expect to get a message a few days later when it has been processed.

我们可以设计分布式系统来模拟这种行为。使用异步通信方式,客户端(称为生产者)将其请求发送到中间消息传递服务。这充当传递机制,将请求转发到预期目的地(称为消费者)进行处理。生产者“发出后就忘记”他们发送的请求。一旦请求被传递到消息传递服务,生产者就会继续其逻辑的下一步,并确信它发送的请求最终会得到处理。这提高了系统响应能力,因为生产者不必等待请求处理完成。

We can design our distributed systems to emulate this behavior. Using an asynchronous communications style, clients, known as producers, send their requests to an intermediary messaging service. This acts as a delivery mechanism to relay the request to the intended destination, known as the consumer, for processing. Producers “fire and forget” the requests they send. Once a request is delivered to the messaging service, the producer moves on to the next step in their logic, confident that the requests it sends will eventually get processed. This improves system responsiveness, in that producers do not have to wait until the request processing is completed.

在本章中,我将描述异步消息传递系统支持的基本通信机制。我还将讨论吞吐量和数据安全之间固有的权衡 - 基本上,确保您的系统不会丢失消息。我还将介绍通常部署在高度可扩展的分布式系统中的三种关键消息传递模式。

In this chapter I’ll describe the basic communication mechanisms that an asynchronous messaging system supports. I’ll also discuss the inherent trade-offs between throughput and data safety—basically, making sure your systems don’t lose messages. I’ll also cover three key messaging patterns that are commonly deployed in highly scalable distributed systems.

为了使这些概念具体化,我将描述RabbitMQ,一个广泛使用的部署了开源消息系统。在介绍了该技术的基础知识之后,我将重点介绍设计高吞吐量消息传递系统时需要了解的核心功能集。

To make these concepts concrete, I’ll describe RabbitMQ, a widely deployed open source messaging system. After introducing the basics of the technology, I’ll focus on the core set of features you need to be aware of in order to design a high-throughput messaging system.

消息传递简介

Introduction to Messaging

异步消息平台是一个成熟的技术领域,该领域拥有多种产品。1久负盛名的 IBM MQ 系列出现于 1993 年,至今仍然是中流砥柱的企业系统。Java 消息服务 (JMS) 是一种 API 级规范,受到多个 JEE 供应商实现的支持。RabbitMQ(我将在本章后面用作说明)可以说是部署最广泛的开源消息传递系统。在消息传递世界中,您永远不会缺少选择。

Asynchronous messaging platforms are a mature area of technology, with multiple products in the space.1 The venerable IBM MQ Series appeared in 1993 and is still a mainstay of enterprise systems. The Java Messaging Service (JMS), an API-level specification, is supported by multiple JEE vendor implementations. RabbitMQ, which I’ll use as an illustration later in this chapter, is arguably the most widely deployed open source messaging system. In the messaging world, you will never be short of choice.

虽然所有这些竞争产品的具体功能和 API 各不相同,但基本概念几乎相同。我将在下面的小节中介绍这些内容,然后在下一节中描述它们是如何在 RabbitMQ 中实现的。一旦您了解了一个消息传递平台的工作原理,就可以相对简单地理解竞争中固有的相似点和差异

While the specific features and APIs vary across all these competing products, the foundational concepts are pretty much identical. I’ll cover these in the following subsections, and then describe how they are implemented in RabbitMQ in the next section. Once you appreciate how one messaging platform works, it is relatively straightforward to understand the similarities and differences inherent in the competition.

消息传递原语

Messaging Primitives

从概念上讲,消息系统包括以下内容:

Conceptually, a messaging system comprises the following:

消息队列
Message queues
存储序列的队列消息数
Queues that store a sequence of messages
制片人
Producers
发送消息至队列
Send messages to queues
消费者
Consumers
从以下位置检索消息队列
Retrieve messages from queues
消息代理
Message broker
管理一个或多个队列
Manages one or more queues

该方案如图 7-1所示。

This scheme is illustrated in Figure 7-1.

一个简单的消息系统
图 7-1。一个简单的消息系统

消息代理是一种管理一个或多个队列的服务。当消息从生产者发送时对于队列,代理按照消息到达的顺序将消息添加到队列中 — 基本上是一种 FIFO 方法。代理负责有效管理消息接收和保留,直到一个或多个消费者检索消息,然后将消息从队列中删除。管理多个队列和多个请求的消息代理可以有效地利用多个 vCPU 和内存来提供低延迟访问。

A message broker is a service that manages one or more queues. When messages are sent from producers to a queue, the broker adds messages to the queue in the order they arrive—basically a FIFO approach. The broker is responsible for efficiently managing message receipt and retention until one or more consumers retrieve the messages, which are then removed from the queue. Message brokers that manage many queues and many requests can effectively utilize many vCPUs and memory to provide low latency accesses.

生产者将消息发送到代理上的命名队列。许多生产者可以将消息发送到同一个队列。生产者将等待,直到收到来自代理的确认消息,然后才认为发送操作完成。

Producers send messages to a named queue on a broker. Many producers can send messages to the same queue. A producer will wait until an acknowledgment message is received from the broker before the send operation is considered complete.

许多消费者可以从同一个队列中获取消息。每条消息均由一个消费者检索。消费者检索消息有两种行为模式,称为“拉”“推”。虽然确切的机制是特定于产品的,但基本语义在各种技术中是通用的:

Many consumers can take messages from the same queue. Each message is retrieved by exactly one consumer. There are two modes of behavior for consumers to retrieve messages, known as pull or push. While the exact mechanisms are product-specific, the basic semantics are common across technologies:

  • 在拉模式(也称为轮询)中,消费者向代理发送请求,代理以下一条可供处理的消息进行响应。如果没有可用的消息,消费者必须轮询队列直到消息到达。

  • In pull mode, also known as polling, consumers send a request to the broker, which responds with the next message available for processing. If there are no messages available, the consumer must poll the queue until messages arrive.

  • 在推送模式下,消费者通知代理希望从队列接收消息。消费者提供了一个回调函数,当消息可用时应该调用该函数。然后,消费者会阻塞(或执行其他工作),消息代理将消息传递到回调函数,以便在消息可用时进行处理。

  • In push mode, a consumer informs the broker that it wishes to receive messages from a queue. The consumer provides a callback function that should be invoked when a message is available. The consumer then blocks (or does other work) and the message broker delivers messages to the callback function for processing when they are available.

一般来说,在可用时使用推送模式会更有效,建议使用。它避免了代理可能被来自多个消费者的请求淹没,并使得在代理中更有效地实现消息传递成为可能。

Generally, utilizing the push mode when available is much more efficient and recommended. It avoids the broker being potentially swamped by requests from multiple consumers and makes it possible to implement message delivery more efficiently in the broker.

消费者还将确认收到消息。消费者确认后,代理可以自由地将消息标记为已传递并将其从队列中删除。确认可以自动或手动完成。

Consumers will also acknowledge message receipt. Upon consumer acknowledgment, the broker is free to mark a message as delivered and remove it from the queue. Acknowledgment may be done automatically or manually.

如果使用自动确认,则消息在传递给使用者后、在处理之前会立即得到确认。这提供了最低延迟的消息传递,因为可以在处理消息之前将确认发送回代理。

If automatic acknowledgment is used, messages are acknowledged as soon as they are delivered to the consumer, and before they are processed. This provides the lowest latency message delivery as the acknowledgment can be sent back to the broker before the message is processed.

通常,消费者希望确保消息在确认之前得到完全处理。在这种情况下,它将使用手动确认。这可以防止消息被传递给消费者但由于消费者崩溃而未被处理的可能性。当然,它确实会增加消息确认延迟。无论选择哪种确认模式,未确认的消息都会有效地保留在队列中,并将在稍后的某个时间传递给另一个消费者以供使用。加工。

Often a consumer will want to ensure a message is fully processed before acknowledgment. In this case, it will utilize manual acknowledgments. This guards against the possibility of a message being delivered to a consumer but not being processed due to a consumer crash. It does, of course, increase message acknowledgment latency. Regardless of the acknowledgment mode selected, unacknowledged messages effectively remain on the queue and will be delivered at some later time to another consumer for processing.

消息持久化

Message Persistence

消息代理可以管理多个同一硬件上的队列。默认情况下,消息队列通常是基于内存的,以便为生产者和消费者提供尽可能最快的服务。只要内存充足,在内存中管理队列的开销就最小。然而,如果服务器崩溃,它确实存在消息丢失的风险。

Message brokers can manage multiple queues on the same hardware. By default, message queues are typically memory based, in order to provide the fastest possible service to producers and consumers. Managing queues in memory has minimal overheads, as long as memory is plentiful. It does, however, risk message loss if the server were to crash.

为了防止消息丢失(这种做法称为数据安全),可以将队列配置为持久队列。当一条消息被由生产者放入队列中,直到消息写入磁盘后操作才完成。该方案如图 7-2所示。现在,如果消息代理发生故障,重新启动时它可以将队列内容恢复到故障之前的状态,并且不会丢失任何消息。许多应用程序无法承受丢失消息的损失,因此无法承受持久队列的损失是提供数据安全和容错所必需的。

To guard against message loss—a practice known as data safety—queues can be configured to be persistent. When a message is placed on a queue by a producer, the operation does not complete until the message is written to disk. This scheme is depicted in Figure 7-2. Now, if a message broker should fail, on reboot it can recover the queue contents to the state they existed in before the failure, and no messages will be lost. Many applications can’t afford to lose messages, and hence persistent queues are necessary to provide data safety and fault tolerance.

将消息持久保存到磁盘
图 7-2。将消息持久保存到磁盘

持久队列固有地增加了发送操作的响应时间,但代价是增强了数据安全性。代理通常会在内存和磁盘上维护队列内容,以便在正常操作期间可以以最小的开销将消息传递给消费者。

Persistent queues have an inherent increase in the response time for send operations, with the trade-off being enhanced data safety. Brokers will usually maintain the queue contents in memory as well as on disk so messages can be delivered to consumers with minimal overhead during normal operations.

发布-订阅

Publish–Subscribe

消息队列将每条消息传递到恰好是一个消费者。对于许多用例来说,这正是您想要的——我的在线购买退货只需由原始供应商消耗一次——这样我就可以拿回我的钱。

Message queues deliver each message to exactly one consumer. For many use cases, this is exactly what you want—my online purchase return needs to be consumed just once by the originating vendor—so that I get my money back.

让我们扩展这个用例。假设在线零售商想要对所有购买退货进行分析,以便可以发现退货率较高的供应商并采取一些补救措施。要实现这一点,您只需将所有采购退货消息传递给相应的供应商和新的分析服务即可。这创建了一对多消息传递需求,称为发布-订阅架构模式。在发布-订阅系统中,消息队列是称为主题。主题基本上是一个消息队列,它将每条已发布的消息传递给多个订阅者之一,如图7-3所示。

Let’s extend this use case. Assume the online retailer wants to do an analysis of all purchase returns so it can detect vendors who have a high rate of returns and take some remedial action. To implement this, you could simply deliver all purchase return messages to the respective vendor and the new analysis service. This creates a one-to-many messaging requirement, which is known as a publish–subscribe architecture pattern. In publish–subscribe systems, message queues are known as topics. A topic is basically a message queue that delivers each published message to one of more subscribers, as illustrated in Figure 7-3.

发布-订阅代理架构
图 7-3。发布-订阅代理架构

通过发布-订阅,您可以创建高度灵活和动态的系统。发布者与订阅者解耦,并且订阅者的数量可以动态变化。这使得该架构具有高度可扩展性,因为无需对现有系统进行任何更改即可添加新订阅者。它还使得多个消费者可以并行执行消息处理,从而提高性能。

With publish–subscribe, you can create highly flexible and dynamic systems. Publishers are decoupled from subscribers, and the number of subscribers can vary dynamically. This makes the architecture highly extensible as new subscribers can be added without any changes to the existing system. It also makes it possible to perform message processing by a number of consumers in parallel, thus enhancing performance.

发布-订阅给消息代理带来了额外的性能负担。经纪人有义务将每条消息传递给所有活跃订阅者。由于订阅者不可避免地会在不同时间处理和确认消息,因此代理需要保持消息可用,直到所有订阅者都使用了每条消息。利用推送模型进行消息消费为发布-订阅架构提供了最有效的解决方案。

Publish–subscribe places an additional performance burden on the message broker. The broker is obliged to deliver each message to all active subscribers. As subscribers will inevitably process and acknowledge messages at different times, the broker needs to keep messages available until all subscribers have consumed each message. Utilizing a push model for message consumption provides the most efficient solution for publish–subscribe architectures.

发布-订阅消息传递是构建分布式事件驱动架构的关键组件。在事件驱动的体系结构中,多个服务可以使用消息代理主题发布与某些状态更改相关的事件。服务可以通过订阅主题来注册对各种事件类型的兴趣。然后,针对该主题发布的每个事件都会传递给所有感兴趣的消费者服务。当第 9 章介绍微服务时,我将回到事件驱动的架构。2

Publish–subscribe messaging is a key component for building distributed, event-driven architectures. In event-driven architectures, multiple services can publish events related to some state changes using message broker topics. Services can register interest in various event types by subscribing to a topic. Each event published on the topic is then delivered to all interested consumer services. I’ll return to event-driven architectures when microservices are covered in Chapter 9.2

消息复制

Message Replication

在异步系统中,消息代理可能是一个单点故障。系统或网络故障可能会导致经纪商不可用,从而导致系统无法正常运行。这很少是一个理想的情况。

In an asynchronous system, the message broker is potentially a single point of failure. A system or network failure can cause the broker to be unavailable, making it impossible for the systems to operate normally. This is rarely a desirable situation.

因此,大多数消息代理都支持在多个代理之间物理复制逻辑队列和主题,每个代理运行在自己的节点上。如果一个代理发生故障,那么生产者和消费者可以继续使用副本之一处理消息。该架构如图 7-4所示。发布给领导者的消息会被镜像到追随者,而领导者消费的消息将从追随者中删除。

For this reason, most message brokers enable logical queues and topics to be physically replicated across multiple brokers, each running on their own node. If one broker fails, then producers and consumers can continue to process messages using one of the replicas. This architecture is illustrated in Figure 7-4. Messages published to the leader are mirrored to the follower, and messages consumed from the leader are removed from the follower.

消息队列复制
图 7-4。消息队列复制

最常见的消息队列复制方法称为领导者-跟随者架构。指定一个代理作为领导者,生产者和消费者分别从该领导者发送和接收消息。在后台,领导者将其收到的所有消息复制(或镜像)给跟随者,并删除成功传递的消息。7-4中显示了复制和删除操作。该方案的行为精确程度及其对经纪商绩效的影响本质上取决于实施,因此取决于产品。

The most common approach to message queue replication is known as a leader-follower architecture. One broker is designated as the leader, and producers and consumers send and receive messages respectively from this leader. In the background, the leader replicates (or mirrors) all messages it receives to the follower, and removes messages that are successfully delivered. This is shown in Figure 7-4 with the replicate and remove operations. How precisely this scheme behaves and the effects it has on broker performance is inherently implementation, and hence product dependent.

对于领导者-跟随者消息复制,跟随者被称为热备用,基本上是领导者的副本,在领导者发生故障时可用。在这种故障场景下,生产者和消费者可以通过切换到访问follower来继续操作。这也称为故障转移。故障转移是在消息代理的客户端库中实现的,因此对生产者和消费者来说是透明的。

With leader-follower message replication, the follower is known as a hot standby, basically a replica of the leader that is available if the leader fails. In such a failure scenario, producers and consumers can continue to operate by switching over to accessing the follower. This is also called failover. Failover is implemented in the client libraries for the message broker, and hence occurs transparently to producers and consumers.

实现执行队列复制的代理是一件复杂的事情。复制消息时,代理需要处理许多微妙的故障情况。当讨论转向可扩展数据管理时,我将在第 10 章和第 11章中开始提出这些问题并描述一些解决方案。

Implementing a broker that performs queue replication is a complicated affair. There are numerous subtle failure cases that the broker needs to handle when duplicating messages. I’ll start to raise these issues and describe some solutions in Chapters 10 and 11 when discussions turn to scalable data management.

警告

一些建议:不要考虑滚动自己的复制方案,或任何其他复杂的分布式算法。软件世界充满了构建特定于应用程序的分布式系统基础设施的失败尝试,只是因为可用的解决方案“不能完全满足我们的需求”或“成本太高”。相信我,您的解决方案不会像现有解决方案那样有效,而且开发成本将超出您的预期。你可能最终会扔你的代码消失了。这些算法确实很难大规模正确实施。

Some advice: don’t contemplate rolling your own replication scheme, or any other complex distributed algorithm for that matter. The software world is littered with failed attempts to build application-specific distributed systems infrastructure, just because the solutions available “don’t do it quite right for our needs” or “cost too much.” Trust me—your solution will not work as well as existing solutions and development will cost more than you could ever anticipate. You will probably end up throwing your code away. These algorithms are really hard to implement correctly at scale.

示例:RabbitMQ

Example: RabbitMQ

RabbitMQ 是分布式系统中使用最广泛的消息代理之一。你会遇到部署在所有应用领域,从金融到电信再到建筑环境控制系统。它于 2009 年左右首次发布,现已发展成为功能齐全、开源的分布式消息代理平台,支持大多数主流语言构建客户端。

RabbitMQ is one of the most widely utilized message brokers in distributed systems. You’ll encounter deployments in all application domains, from finance to telecommunications to building environment control systems. It was first released around 2009 and has developed into a full-featured, open source distributed message broker platform with support for building clients in most mainstream languages.

RabbitMQ代理已构建在 Erlang 中,主要提供对高级消息队列协议 (AMQP) 开放标准的支持。3 AMQP 作为一项合作协议定义工作诞生于金融行业。它是一种二进制协议,提供实现该协议的不同产品之间的互操作性。RabbitMQ 开箱即用,支持 AMQP v0-9-1,并通过插件支持 v1.0。

The RabbitMQ broker is built in Erlang, and primarily provides support for the Advanced Message Queuing Protocol (AMQP) open standard.3 AMQP emerged from the finance industry as a cooperative protocol definition effort. It is a binary protocol, providing interoperability between different products that implement the protocol. Out of the box, RabbitMQ supports AMQP v0-9-1, with v1.0 support via a plugin.

消息、交换和队列

Messages, Exchanges, and Queues

在 RabbitMQ 中,生产者和消费者使用客户端 API 来发送和接收来自代理的消息。经纪人提供存储转发功能消息,使用队列以 FIFO 方式处理。代理实现基于交换概念的消息传递模型,它提供了创建消息传递拓扑的灵活机制。

In RabbitMQ, producers and consumers use a client API to send and receive messages from the broker. The broker provides the store-and-forward functionality for messages, which are processed in a FIFO manner using queues. The broker implements a messaging model based on a concept called exchanges, which provide a flexible mechanism for creating messaging topologies.

交换器是一种抽象,它从生产者接收消息并将它们传递到代理中的队列。生产者只将消息写入交换器。消息包含消息有效负载和称为消息元数据的各种属性。该元数据的一个元素是路由键,它是交换器用于将消息传递到预期队列的值。

An exchange is an abstraction that receives messages from producers and delivers them to queues in the broker. Producers only ever write messages to an exchange. Messages contain a message payload and various attributes known as message metadata. One element of this metadata is the routing key, which is a value used by the exchange to deliver messages to the intended queues.

可以将交换器配置为将消息传递到一个或多个队列。消息传递算法取决于交换类型和称为绑定的规则,这些规则使用路由键在交换和队列之间建立关系。表 7-1显示了三种最常用的交换类型。

Exchanges can be configured to deliver a message to one or more queues. The message delivery algorithm depends on the exchange type and rules called bindings, which establish a relationship between an exchange and a queue using the routing key. The three most commonly used exchange types are shown in Table 7-1.

表 7-1。交易所类型
交易所类型 消息路由行为
直接的 根据与每条消息一起发布的路由键值的匹配,将消息传递到队列
话题 根据路由键和用于将队列绑定到交换器的模式的匹配,将消息传递到一个或多个队列
扇出 向所有绑定到exchange的队列投递消息,路由键被忽略

直接交换通常是用于根据匹配的路由键将每条消息传递到一个目标队列。4主题交换是一种基于模式匹配的更加灵活的机制,可用于实现复杂的发布-订阅消息传递拓扑。扇出交换提供了一种简单的一对多广播机制,其中每条消息都发送到所有附加队列。

Direct exchanges are typically used to deliver each message to one destination queue based on matching the routing key.4 Topic exchanges are a more flexible mechanism based on pattern matching that can be used to implement sophisticated publish–subscribe messaging topologies. Fanout exchanges provide a simple one-to-many broadcast mechanism, in which every message is sent to all attached queues.

图 7-5描述了直接交换的运作方式。队列与消费者的交换绑定了三个值,即“法国”、“西班牙”和“葡萄牙”。当消息从发布者到达时,交换器使用附加的路由密钥将消息传递到三个附加队列之一。

Figure 7-5 depicts how a direct exchange operates. Queues are bound to the exchange by consumers with three values, namely “France,” “Spain,” and “Portugal.” When a message arrives from a publisher, the exchange uses the attached routing key to deliver the message to one of the three attached queues.

RabbitMQ 直接交换的示例
图 7-5。RabbitMQ 直接交换的示例

以下代码摘录了如何在 Java 中配置和使用直接交换。RabbitMQ 客户端,即生产者进程和消费者进程,使用通道抽象来建立与经纪人的沟通(下一节将详细介绍渠道)。生产者在代理中创建交换,并将路由键设置为“France”的消息发布到交换。消费者在代理中创建匿名队列,将该队列绑定到发布者创建的交换器,并指定应传递使用路由键“France”发布的消息到这个队列。

The following code shows an excerpt of how a direct exchange is configured and utilized in Java. RabbitMQ clients, namely producer and consumer processes, use a channel abstraction to establish communications with the broker (more on channels in the next section). The producer creates the exchange in the broker and publishes a message to the exchange with the routing key set to “France.” A consumer creates an anonymous queue in the broker, binds the queue to the exchange created by the publisher, and specifies that messages published with the routing key “France'' should be delivered to this queue.

制作人:

Producer:

channel.exchangeDeclare(EXCHANGE_NAME, "direct");
channel.basicPublish(EXCHANGE_NAME, “法国”, null, message.getBytes());
channel.exchangeDeclare(EXCHANGE_NAME, "direct");
channel.basicPublish(EXCHANGE_NAME, “France”, null, message.getBytes());

消费者:

Consumer:

String 队列名称=channel.queueDeclare().getQueue();
channel.queueBind(queueName, EXCHANGE_NAME, “法国”);
String queueName = channel.queueDeclare().getQueue();
channel.queueBind(queueName, EXCHANGE_NAME, “France”);

分布与并发

Distribution and Concurrency

为了充分利用 RabbitMQ 的性能和可扩展性,您必须了解该平台是如何工作的封面。关注的问题涉及客户端和代理如何通信以及如何管理线程。

To get the most from RabbitMQ in terms of performance and scalability, you must understand how the platform works under the covers. The issues of concern relate to how clients and the broker communicate, and how threads are managed.

每个 RabbitMQ 客户端都使用 RabbitMQ 连接来连接到代理。这基本上是一个抽象基于 TCP/IP,并且可以使用用户凭据或 TLS 进行保护。创建连接是一项重量级操作,需要在客户端和服务器之间进行多次往返,因此每个客户端单个长期连接是常见的使用模式。

Each RabbitMQ client connects to a broker using a RabbitMQ connection. This is basically an abstraction on top of TCP/IP, and can be secured using user credentials or TLS. Creating connections is a heavyweight operation, requiring multiple round trips between the client and server, and hence a single long-lived connection per client is the common usage pattern.

为了发送或接收消息,客户端使用连接创建 RabbitMQ 通道。通道是客户端和代理之间的逻辑连接,仅存在于 RabbitMQ 连接的上下文中,如以下代码片段所示:

To send or receive messages, clients use the connection to create a RabbitMQ channel. Channels are a logical connection between a client and the broker, and only exist in the context of a RabbitMQ connection, as shown in the following code snippet:

ConnectionFactory connFactory = new ConnectionFactory();
连接 rmqConn = connFactory.createConnection();
频道频道 = rmqConn.createChannel();
ConnectionFactory connFactory = new ConnectionFactory();
Connection rmqConn = connFactory.createConnection();
Channel channel = rmqConn.createChannel();

可以在同一个客户端中创建多个通道来建立多个逻辑代理连接。这些通道上的所有通信都通过同一 RabbitMQ (TCP) 连接进行多路复用,如图7-6所示。创建通道需要与代理进行网络往返。因此,出于性能原因,理想情况下,通道应具有较长的生命周期,并避免通道流失,即不断创建和销毁通道。

Multiple channels can be created in the same client to establish multiple logical broker connections. All communications over these channels are multiplexed over the same RabbitMQ (TCP) connection, as shown in Figure 7-6. Creating a channel requires a network round trip to the broker. Hence for performance reasons, channels should ideally be long-lived, with channel churn, namely constantly creating and destroying channels, avoided.

RabbitMQ 连接和通道
图 7-6。RabbitMQ 连接和通道

提高吞吐量对于 RabbitMQ 客户端,一个常见的策略是实现多线程生产者和消费者。然而,通道不是线程安全的,这意味着每个线程都需要对通道进行独占访问。如果您的客户端具有长期存在的有状态线程并且可以为每个线程创建一个通道,则这不是问题,如图7-6所示。您启动一个线程,创建一个通道,然后发布或消耗掉。这是每线程通道模型。

To increase the throughput of RabbitMQ clients, a common strategy is to implement multithreaded producers and consumers. Channels, however, are not thread safe, meaning every thread requires exclusive access to a channel. This is not a concern if your client has long-lived, stateful threads and can create a channel per thread, as shown in Figure 7-6. You start a thread, create a channel, and publish or consume away. This is a channel-per-thread model.

然而,在 Tomcat 或 Spring 等应用服务器中,解决方案就没那么简单了。线程的生命周期和调用是由服务器平台控制的,而不是您的代码。解决方案是在服务器初始化时创建一个全局通道池。服务器线程可以按需使用这个预先创建的通道集合,而无需为每个请求创建和删除通道产生开销。每次请求到达进行处理时,服务器线程都会执行以下步骤:

In application servers such as Tomcat or Spring however, the solution is not so simple. The life cycle and invocation of threads is controlled by the server platform, not your code. The solution is to create a global channel pool upon server initialization. This precreated collection of channels can be used on demand by server threads without the overheads of channel creation and deletion per request. Each time a request arrives for processing, a server thread takes the following steps:

  • 从池中检索通道

  • Retrieves a channel from the pool

  • 将消息发送给经纪人

  • Sends the message to the broker

  • 将通道返回池以供后续重用

  • Returns the channel to pool for subsequent reuse

虽然没有本机 RabbitMQ 功能可以执行此操作,但在 Java 中,您可以利用Apache Commons Pool 库来实现通道池。此实现的完整代码包含在本书随附的代码存储库中。下面的代码片段显示了服务器线程如何使用ApacheborrowObject()的和方法。您可以使用 setter 方法调整此对象池的最小和最大大小,以提供应用程序所需的吞吐量:returnObject()GenericObjectPool

While there is no native RabbitMQ capability to do this, in Java you can utilize the Apache Commons Pool library to implement a channel pool. The complete code for this implementation is included in the accompanying code repository for this book. The following code snippet shows how a server thread uses the borrowObject() and returnObject() methods of the Apache GenericObjectPool class. You can tune the minimum and maximum size of this object pool using setter methods to provide the throughput your application desires:

私人布尔sendMessageToQueue(JsonObject消息){
  尝试 {
    通道channel = pool.borrowObject();
      channel.basicPublish(//为简洁起见省略参数)
      pool.returnObject(通道);
      返回真;
    } catch (异常 e) {
      logger.info("向 RabbitMQ 发送消息失败");
      返回假;
    }
  }
private boolean sendMessageToQueue(JsonObject message) {
  try {
    Channel channel = pool.borrowObject();
      channel.basicPublish(// arguments omitted for brevity)
      pool.returnObject(channel);
      return true;
    } catch (Exception e) {
      logger.info("Failed to send message to RabbitMQ");
      return false;
    }
  }

在消费者方面,客户端创建可用于接收消息的通道。消费者可以使用 API 显式地根据需要从队列中检索消息basicGet(),如以下示例所示:

On the consumer side, clients create channels that can be used to receive messages. Consumers can explicitly retrieve messages on demand from a queue using the basicGet() API, as shown in the following example:

布尔 autoAck = true;
GetResponse 响应 =channel.basicGet(queueName, autoAck);
如果(响应==空){
    // 没有可用的消息。决定做什么……
} 别的 {
    // 处理消息
}
boolean autoAck = true;
GetResponse response = channel.basicGet(queueName, autoAck);
if (response == null) {
    // No message available. Decide what to do …
} else {
    // process message
}

这种方法使用模型(轮询)。轮询效率低下,因为它涉及忙等待,迫使消费者不断地请求消息,即使没有可用的消息。在高性能系统中,这不是使用的方法。

This approach uses the pull model (polling). Polling is inefficient as it involves busy-waiting, obliging the consumer to continually ask for messages even if none are available. In high-performance systems, this is not the approach to use.

另一种更好的方法是推送模型。消费者指定为 RabbitMQ 代理向消费者发送或推送的每条消息调用的回调函数。消费者发出对 API 的调用basicConsume()。当队列中的消息可供使用者使用时,使用者上的 RabbitMQ 客户端库会调用与通道关联的另一个线程中的回调。以下代码示例演示如何使用DefaultConsumer传递basicConsume()给建立连接的类型的对象接收消息:

The alternative and preferable method is the push model. The consumer specifies a callback function that is invoked for each message the RabbitMQ broker sends, or pushes, to the consumer. Consumers issue a call to the basicConsume() API. When a message is available for the consumer from the queue, the RabbitMQ client library on the consumer invokes the callback in another thread associated with the channel. The following code example shows how to receive messages using an object of type DefaultConsumer that is passed to basicConsume() to establish a connection:

布尔 autoAck = true;
Channel.basicConsume(queueName, autoAck, "标签",
     新的默认消费者(通道){
         @覆盖
         公共无效handleDelivery(字符串consumerTag,
                                    信封信封,
                                    AMQP.BasicProperties 属性,
                                    字节[]主体)
             抛出 IOException
         {
             // 处理消息
         }
     });
boolean autoAck = true;
channel.basicConsume(queueName, autoAck, "tag",
     new DefaultConsumer(channel) {
         @Override
         public void handleDelivery(String consumerTag,
                                    Envelope envelope,
                                    AMQP.BasicProperties properties,
                                    byte[] body)
             throws IOException
         {
             // process the message
         }
     });

接收a上的消息单通道是单线程的。这使得必须创建多个线程并分配每个线程的通道或通道池才能获得高消息消耗率。以下 Java 代码摘录展示了如何完成此操作。每个线程创建并配置自己的通道并指定回调函数threadCallback()——当传递新消息时,RabbitMQ 客户端应调用该回调函数:

Reception of messages on a single channel is single threaded. This makes it necessary to create multiple threads and allocate a channel-per-thread or channel pool in order to obtain high message consumption rates. The following Java code extract shows how this can be done. Each thread creates and configures its own channel and specifies the callback function—threadCallback()—that should be called by the RabbitMQ client when a new message is delivered:

可运行可运行 = () -> {
      尝试 {
        final Channel channel = connection.createChannel();
        Channel.queueDeclare(QUEUE_NAME, true, false, false, null);
        // 每个接收者最多一条消息
        
        final DeliverCallback threadCallback=(消费者标签,交付)
         -> {
             字符串消息 =
                 新字符串(delivery.getBody(),StandardCharsets.UTF_8);
             // 处理消息
        };
        channel.basicConsume(QUEUE_NAME,
                             false、threadCallback、consumerTag -> {});
        //
      } catch (IOException e) {
        logger.info(e.getMessage());
      }
Runnable runnable = () -> {
      try {
        final Channel channel = connection.createChannel();
        channel.queueDeclare(QUEUE_NAME, true, false, false, null);
        // max one message per receiver
        
        final DeliverCallback threadCallback = (consumerTag, delivery) 
         -> {
             String message = 
                 new String(delivery.getBody(), StandardCharsets.UTF_8);
             // process the message 
        };
        channel.basicConsume(QUEUE_NAME, 
                             false, threadCallback, consumerTag -> {});
        // 
      } catch (IOException e) {
        logger.info(e.getMessage());
      }

为了获得高性能和可扩展性,RabbitMQ 需要欣赏的另一个重要方面是消息代理使用的线程模型。在代理中,每个队列由单个线程管理。这意味着,如果您的队列数量至少与底层节点上的核心数量相同,则可以提高多核节点上的吞吐量。相反,如果代理节点上的利用率较高的队列多于内核数量,则可能会出现性能下降的情况。

Another important aspect of RabbitMQ to appreciate in order to obtain high performance and scalability is the thread model used by the message broker. In the broker, each queue is managed by a single thread. This means you can increase throughput on a multicore node if you have at least as many queues as cores on the underlying node. Conversely, if you have many more highly utilized queues than cores on your broker node, you are likely to see some performance degradation.

与大多数消息代理一样,当消耗率跟上生产率时,RabbitMQ 表现最佳。当队列变长,达到数万条消息时,管理队列的线程将经历更多的开销。默认情况下,代理将利用其运行的节点的 40% 的可用内存。当达到此限制时,代理将开始限制生产者,减慢代理接受消息的速度,直到内存使用率降至 40% 阈值以下。内存阈值是可配置的,并且这是一个可以设置的设置根据您的工作负载进行调整以优化消息吞吐量。5

Like most message brokers, RabbitMQ performs best when consumption rates keep up with production rates. When queues grow long, in the order of tens of thousands of messages, the thread managing a queue will experience more overheads. By default, the broker will utilize 40% of the available memory of the node it is running on. When this limit is reached, the broker will start to throttle producers, slowing down the rate at which the broker accepts messages, until the memory usage drops below the 40% threshold. The memory threshold is configurable and again, this is a setting that can be tuned to your workload to optimize message throughput.5

数据安全和性能权衡

Data Safety and Performance Trade-offs

所有消息系统都面临困境围绕性能与可靠性的权衡。在这种特殊情况下,核心问题是消息传递的可靠性,通常称为数据安全。您希望消息以最小的延迟在生产者和消费者之间传输,当然您不希望在此过程中丢失任何消息。曾经。像那么简单就好了。请记住,这些是分布式系统。

All messaging systems present a dilemma around a performance versus reliability trade-off. In this particular case, the core issue is the reliability of message delivery, commonly known as data safety. You want your messages to transit between producer and consumer with minimum latency, and of course you don’t want to lose any messages along the way. Ever. If only it were that simple. These are distributed systems, remember.

当消息从生产者传输到消费者时,您必须在设计中理解并考虑多种故障场景。这些都是:

When a message transits from producer to consumer, there are multiple failure scenarios you have to understand and cater for in your design. These are:

  • 生产者向代理发送消息,但代理未成功接受消息。

  • A producer sends a message to a broker and message is not successfully accepted by the broker.

  • 消息在队列中并且代理崩溃。

  • A message is in a queue and the broker crashes.

  • 消息已成功传递给消费者,但消费者在完全处理消息之前失败。

  • A message is successfully delivered to the consumer but the consumer fails before fully processing the message.

如果您的应用程序可以容忍消息丢失,那么您可以选择最大化性能的选项。如果您偶尔丢失即时消息应用程序中的消息,可能并不重要。在这种情况下,您的系统可以忽略消息安全问题并全力运行。例如,采购系统的情况并非如此。如果采购订单丢失,企业就会损失金钱和客户。您需要采取保护措施以确保数据安全。

If your application can tolerate message loss, then you can choose options that maximize performance. It probably doesn’t matter if occasionally you lose a message from an instant messaging application. In this case your system can ignore message safety issues and run full throttle. This isn’t the case for, say, a purchasing system. If purchase orders are lost, the business loses money and customers. You need to put safeguards in place to ensure data safety.

与基本上所有消息代理一样,RabbitMQ 具有可用于保证端到端消息传递的功能。这些都是:

RabbitMQ, like basically all message brokers, has features that you can utilize to guarantee end-to-end message delivery. These are:

出版商确认
Publisher-confirms
发布者可以指定它希望从代理处接收消息已成功接收的确认。这不是默认的发布者行为,必须通过调用该confirmSelect()方法将其设置为渠道属性。发布者可以同步等待确认,也可以通过注册回调函数异步等待确认。
A publisher can specify that it wishes to receive acknowledgments from the broker that a message has been successfully received. This is not default publisher behavior and must be set as a channel attribute by calling the confirmSelect() method. Publishers can wait for acknowledgments synchronously, or asynchronously by registering a callback function.
持久化消息和消息队列
Persistent messages and message queues
如果消息代理发生故障,则每个队列存储在内存中的所有消息都会丢失。为了在经纪人崩溃中幸存下来,队列需要配置为持久(持久)。这意味着消息从发布者到达后会立即写入磁盘。当代理在崩溃后重新启动时,它会恢复所有持久队列和消息。在 RabbitMQ 中,队列和单个消息都需要配置为持久性,以提供高级别的数据安全性。
If a message broker fails, all messages stored in memory for each queue are lost. To survive a broker crash, queues need to be configured as persistent (durable). This means messages are written to disk as soon as they arrive from publishers. When a broker is restarted after a crash, it recovers all persistent queues and messages. In RabbitMQ, both queues and individual messages need to be configured as persistent to provide a high level of data safety.
消费者手册致谢
Consumer manual acknowledgments
代理需要知道何时可以认为消息已成功传递给消费者,以便可以删除来自队列的消息。在 RabbitMQ 中,这会在消息写入 TCP 套接字后立即发生,或者在代理收到显式客户端确认时发生。这两种模式分别称为自动确认和手动确认。自动确认会带来数据安全风险,因为在消费者处理消息之前,连接或消费者可能会失败。因此,为了数据安全,重要的是利用手动确认来确保消息在从队列中被逐出之前已被接收和处理。
A broker needs to know when it can consider a message successfully delivered to a consumer so it can remove the message from the queue. In RabbitMQ, this occurs either immediately after a message is written to a TCP socket, or when the broker receives an explicit client acknowledgment. These two modes are known as automatic and manual acknowledgments, respectively. Automatic acknowledgments risk data safety as a connection or a consumer may fail before the consumer processes the message. For data safety, it is therefore important to utilize manual acknowledgments to make sure a message has been both received and processed before it is evicted from the queue.

简而言之,您需要发布者确认、持久队列和消息以及手动消费者确认以实现完整的数据安全。您的系统几乎肯定会受到性能影响,但是你不会丢失消息。

In a nutshell, you need publisher acknowledgments, persistent queues and messages, and manual consumer acknowledgments for complete data safety. Your system will almost certainly take a performance hit, but you won’t lose messages.

可用性和性能的权衡

Availability and Performance Trade-Offs

另一个经典的消息系统权衡是可用性和性能。单个代理是单点故障,因此如果代理崩溃或遇到短暂的网络故障,系统将不可用。作为提高可用性的典型解决方案,解决方案是代理和队列复制。

Another classic messaging system trade-off is between availability and performance. A single broker is a single point of failure, and hence the system will be unavailable if the broker crashes or experiences a transient network failure. The solution, as is typical for increasing availability, is broker and queue replication.

RabbitMQ 提供了两种支持高可用性的方法,称为镜像队列和仲裁队列。虽然实现细节有所不同,但基本原理是相同的,即:

RabbitMQ provides two ways to support high availability, known as mirrored queues and quorum queues. While the details in implementation differ, the basics are the same, namely:

  • 需要将两个或多个 RabbitMQ 代理部署并配置为集群。

  • Two or more RabbitMQ brokers need to be deployed and configured as a cluster.

  • 每个队列都有一个领导者版本和一个或多个追随者。

  • Each queue has a leader version, and one or more followers.

  • 发布者向领导者发送消息,领导者负责将每条消息复制给追随者。

  • Publishers send messages to the leader, and the leader takes responsibility for replicating each message to the followers.

  • 消费者也连接到领导者,当消息在领导者处成功确认时,它们也会从追随者中删除。

  • Consumers also connect to the leader, and when messages are successfully acknowledged at the leader, they are also removed from followers.

  • 由于所有发布者和消费者活动均由领导者处理,因此仲裁队列和镜像队列都增强了可用性,但不支持负载平衡。消息吞吐量受到领导副本可能性能的限制。

  • As all publisher and consumer activity is processed by the leader, both quorum and mirrored queues enhance availability but do not support load balancing. Message throughput is limited by the performance possible for the leader replica.

仲裁队列和镜像队列支持的确切功能存在许多差异。然而,关键的区别在于如何复制消息以及在领导者失败的情况下如何选择新的领导者。在这种情况下,法定人数本质上意味着多数。如果有五个队列副本,则至少三个副本(领导者和两个追随者)需要持久保存新发布的消息。仲裁队列实现了一种称为 RAFT 的算法来管理复制并在领导者可用时选举新的领导者。我将在第 12 章中详细讨论 RAFT 。

There are numerous differences in the exact features supported by quorum and mirrored queues. The key difference, however, revolves around how messages are replicated and how a new leader is selected in case of leader failure. Quorum in this context essentially means a majority. If there are five queue replicas, then at least three replicas—the leader and two followers—need to persist a newly published message. Quorum queues implement an algorithm known as RAFT to manage replication and elect a new leader when a leader becomes available. I’ll discuss RAFT in some detail in Chapter 12.

仲裁队列必须是持久的,因此设计用于数据安全性和可用性优先于性能的用例。在故障处理方面,它们比镜像队列实现还有其他优势。由于这些原因,镜像队列未来版本中将弃用实现。

Quorum queues must be persistent and are therefore designed to be utilized in use cases when data safety and availability take priority over performance. They have other advantages over the mirrored queue implementation in terms of failure handling. For these reasons, the mirrored queue implementation will be deprecated in future versions.

消息传递模式

Messaging Patterns

凭借在企业系统中的悠久使用历史,使用消息传递的应用程序存在完整的设计模式目录。虽然其中许多涉及易于构建和修改系统以及消息安全性的最佳设计实践,但其中许多直接适用于分布式系统的可扩展性。我将在接下来的部分中解释三种最常用的模式。

With a long history of usage in enterprise systems, a comprehensive catalog of design patterns exists for applications that utilize messaging. While many of these are concerned with best design practices for ease of construction and modification of systems and message security, a number apply directly to scalability in distributed systems. I’ll explain three of the most commonly utilized patterns in the next sections.

竞争的消费者

Competing Consumers

消息传递系统的常见要求就是尽快消费队列中的消息。对于竞争消费者模式,这是通过运行多个并发处理消息的消费者线程和/或进程来实现的。这使得应用程序能够根据需要水平扩展消费者来扩展消息处理。总体设计如图7-7所示。

A common requirement for messaging systems is to consume messages from a queue as quickly as possible. With the competing consumers pattern, this is achieved by running multiple consumer threads and/or processes that concurrently processes messages. This enables an application to scale out message processing by horizontally scaling the consumers as needed. The general design is shown in Figure 7-7.

竞争消费者模式
图 7-7。竞争消费者模式

使用此模式,可以使用推或拉模型在消费者之间动态分发消息。使用推送方法,代理负责选择要向其传递消息的消费者。一种常见的方法是简单的循环分配算法,例如在 RabbitMQ 和 ActiveMQ 中实现的方法。这确保了消息均匀地分发给消费者。

Using this pattern, messages can be distributed across consumers dynamically using either the push or a pull model. Using the push approach, the broker is responsible for choosing a consumer to deliver a message to. A common method, which, for example, is implemented in RabbitMQ and ActiveMQ, is a simple round-robin distribution algorithm. This ensures an even distribution of messages to consumers.

通过拉取方法,消费者只需尽快处理消息即可消费消息。假设一个多线程消费者,如果一个消费者在 8 核节点上运行,另一个消费者在 2 核节点上运行,我们预计前者处理的消息量大约是后者的四倍。因此,负载平衡通过拉动方法自然发生。

With the pull approach, consumers simply consume messages as quickly as they can process them. Assuming a multithreaded consumer, if one consumer is running on an 8-core node and another on a 2-core node, we’d expect the former would process approximately four times the amount of messages of the latter. Hence, load balancing occurs naturally with the pull approach.

这种模式具有三个主要优点,即:

There are three key advantages to this pattern, namely:

可用性
Availability
如果一个消费者发生故障,系统仍然可用,并且其消息份额会简单地分发给其他竞争消费者。
If one consumer fails, the system remains available, and its share of messages is simply distributed to the other competing consumers.
故障处理
Failure handling
如果消费者失败,未确认的消息将传递给另一个队列消费者。
If a consumer fails, unacknowledged messages are delivered to another queue consumer.
动态负载平衡
Dynamic load balancing
新的消费者可以在高负载期间启动,并在负载减少时停止,而无需更改任何队列或消费者配置。
New consumers can be started under periods of high load and stopped when load is reduced, without the need to change any queue or consumer configurations.

在任何生产质量的消息传递平台中都可以找到对竞争消费者的支持。这是一种强大的扩展方式从单个队列输出消息处理。

Support for competing consumers will be found in any production-quality messaging platform. It is a powerful way to scale out message processing from a single queue.

一次性处理

Exactly-Once Processing

正如我在第 3 章中讨论的,瞬时网络故障和延迟响应可能会导致客户端重新发送消息。这可能会导致服务器收到重复的消息。为了缓解这个问题,我们需要采取措施确保幂等处理。

As I discussed in Chapter 3, transient network failures and delayed responses can cause a client to resend a message. This can potentially lead to duplicate messages being received by a server. To alleviate this issue, we need to put in place measures to ensure idempotent processing.

在异步消息传递系统中,有两个处理重复消息的来源。第一个是来自发布者的重复消息,第二个是消费者多次处理消息。这两个问题都需要解决,以确保每条消息都被一次性处理。

In asynchronous messaging systems, there are two sources for duplicate messages being processed. The first is duplicates from the publisher, and the second is consumers processing a message more than once. Both need to be addressed to ensure exactly-once processing of every message.

问题的发布者部分源于发布者在未收到消息代理的确认时重试消息。如果收到原始消息而确认丢失或延迟,则可能会导致队列中出现重复消息。幸运的是,一些消息代理提供了对这种重复检测的支持,从而确保重复项不会发布到队列中。例如,ActiveMQ Artemis 版本可以删除从发布者发送到代理的重复项。该方法基于我在第 3 章中描述的解决方案,对每条消息使用客户端生成的唯一幂等键值。发布者只需将特定的消息属性设置为唯一值,如以下代码所示:

The publisher part of the problem originates from a publisher retrying a message when it does not receive an acknowledgment from the message broker. If the original message was received and the acknowledgment lost or delayed, this may lead to duplicates on the queue. Fortunately, some message brokers provide support for this duplicate detection, and thus ensure duplicates do not get published to a queue. For example, the ActiveMQ Artemis release can remove duplicates that are sent from the publisher to the broker. The approach is based on the solution I described in Chapter 3, using client-generated, unique idempotency key values for each message. Publishers simply need to set a specific message property to a unique value, as shown in the following code:

ClientMessage msg = session.createMessage(true);
UUID idKey = UUID.randomUUID(); // 用作幂等键
msg.setStringProperty(HDR_DUPLICATE_DETECTION_ID, idKey.toString() );
ClientMessage msg = session.createMessage(true);
UUID idKey = UUID.randomUUID();  // use as idempotence key
msg.setStringProperty(HDR_DUPLICATE_DETECTION_ID, idKey.toString() );

代理利用缓存来存储幂等键值并检测重复项。这有效地消除了队列中的重复消息,解决了问题的第一部分。

The broker utilizes a cache to store idempotency key values and detect duplicates. This effectively eliminates duplicate messages from the queue, solving the first part of your problem.

在消费者方面,当代理将消息传递给消费者,消费者对其进行处理然后无法发送确认(消费者崩溃或网络丢失确认)时,就会发生重复。因此,如果应用程序使用竞争消费者模式,代理可能会重新传递消息,可能会传递给不同的消费者。

On the consumer side, duplicates occur when the broker delivers a message to a consumer, which processes it and then fails to send an acknowledgment (consumer crashes or the network loses the acknowledgment). The broker therefore redelivers the message, potentially to a different consumer if the application utilizes the competing consumer pattern.

防止重复处理是消费者的义务。同样,我在第 3 章中描述的机制,即为已处理的消息维护幂等性密钥的缓存或数据库。大多数代理都会设置一个消息头来指示消息是否是重新传递。这可以用于消费者实现幂等性。它不能保证消费者已经看到该消息。它只是告诉您经纪人已交付它和消息仍然未被确认。

It’s the obligation of consumers to guard against duplicate processing. Again, the mechanisms I described in Chapter 3, namely maintaining a cache or database of idempotency keys for messages that have been processed. Most brokers will set a message header that indicates if a message is a redelivery. This can be used in the consumer implementation of idempotence. It doesn’t guarantee a consumer has seen the message already. It just tells you that the broker delivered it and the message remains unacknowledged.

有毒信息

Poison Messages

有时消息会发送至消费者无法被处理。造成这种情况的可能原因有很多。最常见的可能是生产者发送的消息无法被消费者处理的错误。这可能是由于格式错误的 JSON 负载或某些意外的状态更改等原因造成的,例如,消息中的StudentID字段表示刚从机构退学并且在数据库中不再处于活动状态的学生。无论出于何种原因,这些有害消息都会产生以下两种影响之一:

Sometimes messages delivered to consumers can’t be processed. There are numerous possible reasons for this. Probably most common are errors in producers that send messages that cannot be handled by consumers. This could be for reasons such as a malformed JSON payload or some unanticipated state change, for example, a StudentID field in a message for a student who has just dropped out from the institution and is no longer active in the database. Regardless of the reason, these poison messages have one of two effects:

  • 它们导致消费者崩溃。这可能在开发和测试中的系统中最常见。但有时,这些问题会潜入生产中,而失败的消费者肯定会导致一些严重的运营问题。

  • They cause the consumer to crash. This is probably most common in systems under development and test. Sometimes, though, these issues sneak into production, when failing consumers are sure to cause some serious operational headaches.

  • 它们导致消费者拒绝消息,因为它无法成功处理有效负载。

  • They cause the consumer to reject the message as it is not able to successfully process the payload.

在任何一种情况下,假设需要消费者确认,则消息将以未确认状态保留在队列中。在某些特定于代理的机制(通常是超时或否定确认)之后,有害消息将被传递给另一个消费者进行处理,从而产生可预测的不良结果。

In either case, assuming consumer acknowledgments are required, the message remains on the queue in an unacknowledged state. After some broker-specific mechanism, typically a timeout or a negative acknowledgment, the poison message will be delivered to another consumer for processing, with predictable, undesirable results.

如果没有以某种方式检测到有害消息,它们可以无限期地传递。这充其量会占用处理能力,从而降低系统吞吐量。在最坏的情况下,每次收到有害消息时,它都会使消费者崩溃,从而导致系统瘫痪。

If poison messages are not somehow detected, they can be delivered indefinitely. This at best takes up processing capacity and hence reduces system throughput. At worst it can bring a system to its knees by crashing consumers every time a poison message is received.

处理有害消息的解决方案是限制重新传递消息的次数。当达到重新传递限制时,消息会自动移动到收集有问题的请求的队列。这个队列传统上被称为“死信队列” ,而且相当可怕。

The solution to poison message handling is to limit the number of times a message can be redelivered. When the redelivery limit is reached, the message is automatically moved to a queue where problematic requests are collected. This queue is traditionally and rather macabrely known as the dead-letter queue.

正如您现在无疑所期望的那样,实现有害消息处理的确切机制因消息传递而异平台。例如,Amazon Simple Queue Service (SQS) 定义了一个策略,用于指定与应用程序定义的队列关联的死信队列。该策略还规定了多少次重新传递后,消息应自动从应用程序队列移动到死信队列。该值称为maxReceiveCount

As you no doubt expect by now, the exact mechanism for implementing poison message handling varies across messaging platforms. For example, Amazon Simple Queue Service (SQS) defines a policy that specifies the dead-letter queue that is associated with an application-defined queue. The policy also states after how many redeliveries a message should be automatically moved from the application queue to the dead-letter queue. This value is known as the maxReceiveCount.

在SQS中,每条消息都有一个ReceiveCount属性,当消费者未成功处理消息时,该属性会递增。当ReceiveCount超过队列的定义maxReceiveCount值时,SQS 会将消息移至死信队列。重新交付的合理值因应用程序特性而异,但常见范围为三到五个。

In SQS, each message has a ReceiveCount attribute, which is incremented when a message is not successfully processed by a consumer. When the ReceiveCount exceeds the defined maxReceiveCount value for a queue, SQS moves the message to the dead-letter queue. Sensible values for redelivery vary with application characteristics, but a range of three to five is common.

有害消息处理的最后一部分是诊断消息被重定向到死信队列的原因。首先,您需要设置某种形式的监控警报,向工程师发送消息处理失败的通知。到了那个阶段,诊断将包括检查日志中是否存在导致处理失败的异常以及分析消息内容以识别生产者或消费者问题。

The final part of poison message handling is diagnosing the cause for messages being redirected to the dead-letter queue. First, you need to set some form of monitoring alert that sends a notification to engineers that a message has failed processing. At that stage, diagnosis will comprise examining logs for exceptions that caused processing to fail and analyzing the message contents to identify producer or consumer issues.

总结和延伸阅读

Summary and Further Reading

异步消息传递是可扩展系统架构的一个组成部分。在请求经历高峰和低谷的系统中,消息传递特别有吸引力。在高峰时段,生产者可以将请求添加到队列中并快速响应客户端,而无需等待请求被处理。

Asynchronous messaging is an integral component of scalable system architectures. Messaging is particularly attractive in systems that experience peaks and troughs in request. During peak times, producers can add requests to queues and respond rapidly to clients, without having to wait for the requests to be processed.

消息传递将生产者与消费者脱钩,从而可以独立扩展它们。架构可以利用这一点,通过弹性扩展生产者和消费者来匹配流量模式并平衡消息吞吐量要求与成本。消息队列可以分布在多个代理上以扩展消息吞吐量。还可以复制队列以增强可用性。

Messaging decouples producers from consumers, making it possible to scale them independently. Architectures can take advantage of this by elastically scaling producers and consumers to match traffic patterns and balance message throughput requirements with costs. Message queues can be distributed across multiple brokers to scale message throughput. Queues can also be replicated to enhance availability.

消息传递并非没有危险。可以将重复项放置在队列中,如果队列维护在内存中,则消息可能会丢失。向消费者的交付可能会丢失,如果确认丢失,一条消息可能会被多次使用。这些数据安全问题需要在设计时注意细节,以便对重复消息和消息丢失的容忍度与系统要求相匹配。

Messaging is not without its dangers. Duplicates can be placed on queues, and messages can be lost if queues are maintained in memory. Deliveries to consumers can be lost, and a message can be consumed more than once if acknowledgments are lost. These data safety issues require attention to detail in design so that tolerance for duplicate messages and message loss is matched to the system requirements.

如果您有兴趣获得有关消息传递架构和系统的广泛而深入的知识,Gregor Hohpe 和 Bobby Woolf 所著的经典书籍《企业集成模式》(Addison-Wesley Professional,2003 年)应该是您的第一站。其他优秀的知识来源往往是特定于消息传递平台的,并且由于有很多竞争平台,因此有很多书籍可供选择。我最喜欢的关于一般消息传递智慧和 RabbitMQ 特定信息的 RabbitMQ 书籍是David Dossot 和 Lovisa Johansson 的RabbitMQ Essentials,第二版(Packt,2014 年)和Gavin M. Roy 的RabbitMQ in Depth(Manning,2017 年)。

If you are interested in acquiring a broad and deep knowledge of messaging architectures and systems, the classic book Enterprise Integration Patterns by Gregor Hohpe and Bobby Woolf (Addison-Wesley Professional, 2003) should be your first stop. Other excellent sources of knowledge tend to be messaging platform specific, and as there are a lot of competing platforms, there’s a lot of books to choose from. My favorite RabbitMQ books for general messaging wisdom and RabbitMQ-specific information are RabbitMQ Essentials, 2nd ed., by David Dossot and Lovisa Johansson (Packt, 2014) and RabbitMQ in Depth by Gavin M. Roy (Manning, 2017).

最后一点,异步通信的主题以及随之而来的优点和问题将贯穿本书的其余部分。消息传递是基于微服务的架构(第 9 章)的关键组件,也是分布式数据库如何运行的基础。当我在第四部分介绍流系统和事件驱动处理时,您肯定会认识到本章的主题。

On a final note, the theme of asynchronous communications and the attendant advantages and problems will permeate the remainder of this book. Messaging is a key component of microservice-based architectures (Chapter 9) and is foundational to how distributed databases function. And you’ll certainly recognize the topics of this chapter when I cover streaming systems and event-driven processing in Part IV.

1有关消息传递技术前景的有用概述,请访问https://oreil.ly/KMvTp

1 A helpful overview of the messaging technologies landscape can be found at https://oreil.ly/KMvTp.

2 Mark Richards 和 Neal Ford 所著的《软件架构基础》14 章是事件驱动架构的极好知识来源。

2 Chapter 14 of Fundamentals of Software Architecture by Mark Richards and Neal Ford is an excellent source of knowledge for event-driven architectures.

3通过插件支持其他协议,例如 STOMP 和 MQTT。

3 Other protocols such as STOMP and MQTT are supported via plugins.

4消费者可以多次调用queueBind()来指定其目的地应该接收多个路由键值的消息。这种方法可用于创建一对多消息分发。主题交换对于一对多消息传递来说更强大。

4 Consumers can call queueBind() multiple times to specify that their destination should receive messages for more than one routing key value. This approach can be used to create one-to-many message distribution. Topic exchanges are more powerful for one-to-many messaging.

5 RabbitMQ 内存警报页面提供了如何配置 RabbitMQ 服务器内存的完整说明。

5 A complete description of how the RabbitMQ server memory can be configured is available at the RabbitMQ Memory Alarms page.

第 8 章无服务器处理系统

Chapter 8. Serverless Processing Systems

可扩展的系统经历了广泛不同的使用模式。对于某些应用程序,负载在工作时间可能会很高,而在非工作时间可能会很低或不存在。其他应用程序(例如在线音乐会门票销售系统)可能在 99% 的时间里背景流量都较低。但是,当一系列主要演出的门票发布时,需求可能会在几个小时内飙升至平均负载的 10,000 倍,然后又回落到正常水平。

Scalable systems experience widely varying patterns of usage. For some applications, load may be high during business hours and low or nonexistent during nonbusiness hours. Other applications, for example, an online concert ticket sales system, might have low background traffic 99% of the time. But when tickets for a major series of shows are released, the demand can spike by 10,000 times the average load for a number of hours before dropping back down to normal levels.

正如第 5 章所述,弹性负载平衡是处理这些峰值的一种方法。另一个是无服务器计算,我将在本章中研究它。

Elastic load balancing, as described in Chapter 5, is one approach for handling these spikes. Another is serverless computing, which I’ll examine in this chapter.

无服务器的吸引力

The Attractions of Serverless

主要组织 IT 系统从本地部署到公共云平台部署的转变似乎是不可避免的。从初创公司到政府机构再到跨国公司的组织将云视为数字化转型平台和提高业务连续性的基础技术。

The transition of major organizational IT systems from on-premises to public cloud platforms deployments seems inexorable. Organizations from startups to government agencies to multinationals see clouds as digital transformation platforms and a foundational technology to improve business continuity.

云平台的两大吸引力在于其按需付费和快速部署的能力扩大(和缩小)虚拟资源以满足波动的工作负载和数据量。当然,这种扩展能力并不是免费的。您的应用程序需要进行架构设计,以利用云平台提供的可扩展服务。当然,正如我在第 1 章中讨论的那样,成本和规模有着不可磨灭的联系。系统长时间使用的资源越多,月底的云账单就越大。

Two of the great attractions of cloud platforms are their pay-as-you-go billing and ability to rapidly scale up (and down) virtual resources to meet fluctuating workloads and data volumes. This ability to scale, of course, doesn’t come for free. Your applications need to be architected to leverage the scalable services provided by cloud platforms. And of course, as I discussed in Chapter 1, cost and scale are indelibly connected. The more resources a system utilizes for extended periods, the larger your cloud bills will be at the end of the month.

每月的云账单可以大的。真的很大。更糟糕的是,出乎意料的大!因云计算超支而出现“标价震惊”的情况比比皆是——在一项调查中,69% 的受访者经常在云预算上超支超过 25%。在一个著名的案例中,一项 Azure 任务在被注意到之前就花费了 50 万美元。造成超支的原因有很多,包括缺乏自动扩展解决方案的部署、长期容量规划不善以及云架构利用不足导致系统占用空间臃肿。

Monthly cloud bills can be big. Really big. Even worse, unexpectedly big! Cases of “sticker shock” for significant cloud overspend are rife—in one survey, 69% of respondents regularly overspent on their cloud budget by more than 25%. In one well-known case, $500K was spent on an Azure task before it was noticed. Reasons attributed to overspending are many, including lack of deployment of autoscaling solutions, poor long-term capacity planning, and inadequate exploitation of cloud architectures leading to bloated system footprints.

在云平台上,架构师面临着无数的架构决策。这些决策既广泛,涉及系统采用的整体架构模式或风格(例如微服务、N 层、事件驱动),也涉及狭义,特定于系统所基于的各个组件和云服务。

On a cloud platform, architects are confronted with a myriad of architectural decisions. These decisions are both broad, in terms of the overall architectural patterns or styles the systems adopts—for example, microservices, N-tier, event driven—and narrow, specific to individual components and the cloud services that the system is built upon.

从这个意义上说,具有架构意义的决策贯穿于云上系统设计和部署的各个方面。当您收到每月的云支出账单时,这些决定的集体后果非常明显。

In this sense, architecturally significant decisions pervade all aspects of the system design and deployment on the cloud. And the collective consequences of these decisions are highly apparent when you receive your monthly cloud spending bill.

传统上,云应用程序已部署在利用虚拟机 (VM) 的基础设施即服务 (IaaS) 平台上。在这种情况下,您需要为部署的资源付费,无论资源利用率如何。如果负载增加,弹性应用程序可以启动新的虚拟机来增加容量,通常使用云提供的负载平衡服务。您的成本基本上与您选择的虚拟机类型、部署的持续时间以及应用程序存储和传输的数据量成正比。

Traditionally, cloud applications have been deployed on an infrastructure as a service (IaaS) platform utilizing virtual machines (VMs). In this case, you pay for the resources you deploy regardless of how highly utilized they are. If load increases, elastic applications can spin up new virtual machines to increase capacity, typically using the cloud-provided load balancing service. Your costs are essentially proportional to the type of VMs you choose, the duration they are deployed for, and the amount of data the application stores and transmits.

主要云提供商提供替代方案显式配置虚拟处理资源。它们被称为无服务器平台,不需要静态配置任何计算资源。使用 AWS Lambda 或 Google App Engine (GAE) 等技术加载应用程序代码并在请求到达时按需执行。如果没有活动请求,则基本上没有资源在使用,也无需支付费用。

Major cloud providers offer an alternative to explicitly provisioning virtual processing resources. Known as serverless platforms, they do not require any compute resources to be statically provisioned. Using technologies such as AWS Lambda or Google App Engine (GAE), the application code is loaded and executed on demand, when requests arrive. If there are no active requests, there are essentially no resources in use and no charges to meet.

无服务器平台还可以管理为您自动缩放(向上和向下)。当同时请求到达时,会创建额外的处理能力来处理请求,并且在理想情况下提供一致的低响应时间。当请求负载下降时,额外的处理能力将被停用,并且不会产生任何费用。

Serverless platforms also manage autoscaling (up and down) for you. As simultaneous requests arrive, additional processing capacity is created to handle requests and, ideally, provide consistently low response times. When request loads drop, additional processing capacity is decommissioned, and no charges are incurred.

每个无服务器平台的实现细节各不相同。例如,通常支持有限数量的主流编程语言和应用服务器框架。平台提供多种配置设置,可用于平衡性能、可扩展性和成本。一般来说,成本与以下因素成正比:

Every serverless platform varies in the details of its implementation. For example, a limited number of mainstream programming languages and application server frameworks are typically supported. Platforms provide multiple configuration settings that can be used to balance performance, scalability and costs. In general, costs are proportional to the following factors:

  • 选择执行请求的处理实例的类型

  • The type of processing instance chosen to execute a request

  • 请求数量和每个请求的处理持续时间

  • The number of requests and processing duration for each request

  • 每个应用程序服务器实例在无服务器基础架构上驻留多长时间

  • How long each application server instance remains resident on the serverless infrastructure

然而,不同供应商所使用的具体参数差异很大。每个平台都是专有的,并且存在细微的差异。像往常一样,魔鬼潜伏在细节中。那么,让我们来探讨其中的一些专门针对 GAE 和 AWS Lambda 平台的可怕细节。

However, the exact parameters used vary considerably across vendors. Every platform is proprietary and different in subtle ways. The devil lurks, as usual, in the details. So, let’s explore some of those devilish details specifically for the GAE and AWS Lambda platforms.

谷歌应用引擎

Google App Engine

Google App Engine (GAE) 是 Google 推出的第一个产品,是现在所谓的应用程序引擎的一部分。谷歌云平台(GCP)。它有自 2011 年起全面发布,使开发人员能够在 Google 的托管云基础设施上上传和执行基于 HTTP 的应用程序服务。

Google App Engine (GAE) was the first offering from Google as part of what is now known as the Google Cloud Platform (GCP). It has been in general release since 2011 and enables developers to upload and execute HTTP-based application services on Google’s managed cloud infrastructure.

基础

The Basics

GAE 支持使用 Go、Java、Python、Node.js、PHP、.NET 和 Ruby 开发应用程序。建立一个在 GAE 上开发应用程序时,开发人员可以利用常见的基于 HTTP 的应用程序框架,这些框架是使用 Google 提供的 GAE 运行时库构建的。例如,在 Python 中,应用程序可以利用 Flask、Django 和 web2py,而在 Java 中,主要支持的平台是构建在 Jetty JEE Web 容器上的 servlet。

GAE supports developing applications in Go, Java, Python, Node.js, PHP, .NET, and Ruby. To build an application on GAE, developers can utilize common HTTP-based application frameworks that are built with the GAE runtime libraries provided by Google. For example, in Python, applications can utilize Flask, Django, and web2py, and in Java the primary supported platform is servlets built on the Jetty JEE web container.

应用程序执行是由 GAE 动态管理,GAE 启动计算资源以满足请求需求水平。应用程序通常访问托管持久存储平台,例如 Google 的FirestoreGoogle Cloud SQL,或与消息服务(例如 Google 的Cloud Pub/Sub )交互。

Application execution is managed dynamically by GAE, which launches compute resources to match request demand levels. Applications generally access a managed persistent storage platform such as Google’s Firestore or Google Cloud SQL, or interact with a messaging service like Google’s Cloud Pub/Sub.

GAE 有两种版本,称为标准环境和灵活环境。基本区别在于标准环境由 GAE 更严格地管理,在支持的语言版本方面有开发限制。这种严格的管理使得可以快速扩展服务以响应增加的负载。相比之下,灵活的环境本质上是 Google 计算引擎 (GCE) 的定制版本,它在虚拟机上的Docker 容器中运行应用程序。顾名思义,它在可使用的开发能力方面提供了更多选择,但不太适合快速扩展。

GAE comes in two flavors, known as the standard environment and the flexible environment. The basic difference is that the standard environment is more closely managed by GAE, with development restrictions in terms of language versions supported. This tight management makes it possible to scale services rapidly in response to increased loads. In contrast, the flexible environment is essentially a tailored version of Google Compute Engine (GCE), which runs applications in Docker containers on VMs. As its name suggests, it gives more options in terms of development capabilities that can be used, but is not as suitable for rapid scaling.

在本章的其余部分,我将重点讨论高度可扩展的标准环境。

In the rest of this chapter, I’ll focus on the highly scalable standard environment.

GAE标准环境

GAE Standard Environment

在标准环境下,开发者将其应用程序代码上传到与基本项目 URL 关联的 GAE 项目。此代码必须定义可由向 URL 发出请求的客户端调用的 HTTP 端点。当收到请求时,GAE 会将其路由到处理实例以执行应用程序代码。这些被称为应用程序的常驻实例,是使用 GAE 所产生成本的主要组成部分。

In the standard environment, developers upload their application code to a GAE project that is associated with a base project URL. This code must define HTTP endpoints that can be invoked by clients making requests to the URL. When a request is received, GAE will route it to a processing instance to execute the application code. These are known as resident instances for the application and are the major component of the cost incurred for utilizing GAE.

每个项目配置都可以指定控制 GAE 何时加载新实例或调用常驻实例的参数集合。两个最简单的设置控制 GAE 在任何时刻驻留的最小和最大实例。最小值可以为零,这对于长期不活动的应用程序来说是完美的,因为这不会产生任何成本。

Each project configuration can specify a collection of parameters that control when GAE loads a new instance or invokes a resident instance. The two simplest settings control the minimum and maximum instances that GAE will have resident at any instant. The minimum can be zero, which is perfect for applications that have long periods of inactivity, as this incurs no costs.

当请求到达并且没有驻留实例时,GAE 动态加载应用程序实例并调用端点的处理。可以将多个同时请求发送到同一个实例,最多可达某个配置的限制(当我在本章后面讨论自动缩放时,会详细介绍这一点)。然后,GAE 将按需加载其他实例,直到达到指定的最大实例值。通过设置最大值,应用程序可以限制成本,尽管如果负载继续增长,延迟可能会增加。

When a request arrives and there are no resident instances, GAE dynamically loads an application instance and invokes the processing for the endpoint. Multiple simultaneous requests can be sent to the same instance, up to some configured limit (more on this when I discuss autoscaling later in this chapter). GAE will then load additional instances on demand until the specified maximum instance value is reached. By setting the maximum, an application can put a lid on costs, albeit with the potential for increased latencies if load continues to grow.

如前所述,标准环境应用程序可以使用 Go、Java、Python、Node.js、PHP 和 Ruby 构建。由于 GAE 本身负责加载应用程序的运行时环境,因此它将每种编程语言支持的版本限制为少量。使用的语言还会影响在 GAE 上加载新实例的时间。例如,Go 等轻量级运行时环境将在不到一秒的时间内在新实例上启动。相比之下,体积较大的 JVM 平均需要 1-3 秒左右。此加载时间还受到应用程序包含的外部库数量的影响。

As mentioned previously, standard environment applications can be built in Go, Java, Python, Node.js, PHP, and Ruby. As GAE itself is responsible for loading the runtime environment for an application, it restricts the supported versions to a small number per programming language. The language used also affects the time to load a new instance on GAE. For example, a lightweight runtime environment such as Go will start on a new instance in less than a second. In comparison, a more bulky JVM is on the order of 1–3 seconds on average. This load time is also influenced by the number of external libraries that the application incorporates.

因此,虽然不同语言之间存在差异,但加载新实例相对较快。无论如何,比启动虚拟机快得多。这使得标准环境非常适合经历负载快速峰值的应用程序。随着请求量的增加,GAE 能够快速添加新的常驻实例。请求根据负载动态路由到实例,因此采用纯粹的无状态应用程序模型来支持有效的负载分配。随后,一旦负载下降,实例就会立即释放,从而再次降低成本。

Hence, while there is variability across languages, loading new instances is relatively fast. Much faster than booting a virtual machine, anyway. This makes the standard environment extremely well suited for applications that experience rapid spikes in load. GAE is able to quickly add new resident instances as request volumes increase. Requests are dynamically routed to instances based on load, and hence assume a purely stateless application model to support effective load distribution. Subsequently, instances are released with little delay once the load drops, again reducing costs.

GAE 的标准环境是一个极其强大的可扩展应用程序平台,我将在本章后面的案例研究中更详细地探讨这一平台。

GAE’s standard environment is an extremely powerful platform for scalable applications, and one I’ll explore in more detail in the case study later in this chapter.

自动缩放

Autoscaling

自动缩放是您指定的选项app.yaml在您上传服务器代码时传递到 GAE 的文件中。自动缩放的应用程序由 GAE 根据默认参数值集合进行管理,您可以在app.yaml. 基本方案如图8-1所示。

Autoscaling is an option that you specify in an app.yaml file that is passed to GAE when you upload your server code. An autoscaled application is managed by GAE according to a collection of default parameter values, which you can override in your app.yaml. The basic scheme is shown in Figure 8-1.

GAE 自动缩放
图 8-1。GAE 自动缩放

GAE 基本上管理着根据传入流量负载为应用程序部署处理实例。如果没有传入请求,则 GAE 将不会调度任何实例。当请求到达时,GAE 会部署一个实例来处理该请求。

GAE basically manages the number of deployed processing instances for an application based on incoming traffic load. If there are no incoming requests, then GAE will not schedule any instances. When a request arrives, GAE deploys an instance to process the request.

部署一个实例可能需要几百毫秒到几秒的时间,具体取决于您所使用的编程语言。这意味着如果没有常驻实例,初始请求的延迟可能会很高。为了减轻实例加载延迟的影响,您可以指定可用于处理请求的最小实例数。当然,这需要花钱。

Deploying an instance can take anywhere between a few hundred ms to a few seconds depending on the programming language you are using. This means latency can be high for initial requests if there are no resident instances. To mitigate this instance loading latency effects, you can specify a minimum number of instances to keep available for processing requests. This, of course, costs money.

随着请求负载的增长,GAE 调度程序将动态加载更多实例来处理请求。三个参数精确控制缩放的操作方式,即:

As the request load grows, the GAE scheduler will dynamically load more instances to handle requests. Three parameters control precisely how scaling operates, namely:

目标CPU利用率
Target CPU utilization
设置CPU利用率阈值在此之上,将启动更多实例来处理流量。范围为 0.5 (50%) 至 0.95 (95%)。默认值为 0.6 (60%)。
Sets the CPU utilization threshold above which more instances will be started to handle traffic. The range is 0.5 (50%) to 0.95 (95%). The default is 0.6 (60%).
最大并发请求数
Maximum concurrent requests
设置最大并发请求数在调度程序生成新实例之前,实例可以接受。默认值为 10,最大值为 80。文档没有说明允许的最小值,但大概 1 将定义单线程服务。
Sets the maximum number of concurrent requests an instance can accept before the scheduler spawns a new instance. The default value is 10, and the maximum is 80. The documentation doesn’t state the minimum allowed value, but presumably 1 would define a single-threaded service.
目标吞吐量利用率
Target throughput utilization
这与指定的值结合使用用于指定何时启动新实例的最大并发请求数。范围为 0.5 (50%) 至 0.95 (95%)。默认值为 0.6 (60%)。它的工作原理如下:当实例的并发请求数达到等于最大并发请求值乘以目标吞吐量利用率的值时,调度程序会尝试启动新实例。
This is used in conjunction with the value specified for maximum concurrent requests to specify when a new instance is started. The range is 0.5 (50%) to 0.95 (95%). The default is 0.6 (60%). It works like this: when the number of concurrent requests for an instance reaches a value equal to maximum concurrent requests value multiplied by the target throughput utilization, the scheduler tries to start a new instance.

了解?很明显,这三个设置相互影响,使得配置有些复杂。默认情况下,在创建新实例之前,实例将处理 10 × 0.6 = 6 个并发请求。如果这 6 个(或更少)请求导致实例的 CPU 利用率超过 60%,调度程序也会尝试创建一个新实例。

Got that? As is hopefully apparent, these three settings interact with each other, making configuration somewhat complex. By default, an instance will handle 10 × 0.6 = 6 concurrent requests before a new instance is created. And if these 6 (or fewer) requests cause the CPU utilization for an instance to go over 60%, the scheduler will also try to create a new instance.

但是等等,还有更多!

But wait, there’s more!

您还可以指定值来控制 GAE 何时根据请求在请求挂起队列(参见图 8-1)中等待分派到实例进行处理的时间添加新实例。该max-pending-latency参数指定在启动其他实例来处理请求并减少延迟之前,GAE 应允许请求在待处理队列中等待的最长时间。默认值为 30 毫秒。该值越低,应用程序扩展的速度越快。而且你可能要花更多的钱。1

You can also specify values to control when GAE adds new instances based on the time requests spend in the request pending queue (see Figure 8-1) waiting to be dispatched to an instance for processing. The max-pending-latency parameter specifies the maximum amount of time that GAE should allow a request to wait in the pending queue before starting additional instances to handle requests and reduce latency. The default value is 30 ms. The lower the value, the quicker an application will scale. And the more it will probably cost you.1

这些自动缩放参数设置使我们能够微调服务的行为以平衡性能和成本。当然,修改这些参数将如何影响应用程序的行为取决于服务的精确功能。然而,这些参数之间存在微妙的相互作用,这一事实使得这种调整工作有些复杂。我将在本章后面的案例研究部分返回这个主题,并解释一种简单的、与平台无关的方法您可以进行服务调整。

These auto-scaling parameter settings give us the ability to fine-tune a service’s behavior to balance performance and cost. How modifying these parameters will affect an application’s behavior is, of course, dependent on the precise functionality of the service. The fact that there are subtle interplays between these parameters makes this tuning exercise somewhat complicated, however. I’ll return to this topic in the case study section later in this chapter, and explain a simple, platform-agnostic approach you can take to service tuning.

AWS Lambda

AWS Lambda

AWS Lambda 是亚马逊的无服务器平台。基本设计原则和主要特征与 GAE 和其他无服务器平台。开发人员上传代码,将其部署为称为 Lambda 函数的服务。调用时,Lambda 提供特定于语言的执行环境来运行函数代码。

AWS Lambda is Amazon’s serverless platform. The underlying design principles and major features echo that of GAE and other serverless platforms. Developers upload code which is deployed as services known as Lambda functions. When invoked, Lambda supplies a language-specific execution environment to run the function code.

以下代码显示了 Python Lambda 函数的简单示例。该函数只是从输入事件中提取一条消息,并将其作为 HTTP200响应的一部分原封不动地返回。通常,您实现一个接受事件和上下文参数的函数。该事件是一个 JSON 格式的文档,封装了供 Lambda 函数处理的数据。例如,如果 Lambda 函数处理 HTTP 请求,则事件将包含 HTTP 标头和请求正文。上下文包含有关函数和运行时环境的元数据,例如函数版本号和执行环境中的可用内存:

A simple example of a Python Lambda function is shown in the following code. This function simply extracts a message from the input event and returns it unaltered as part of an HTTP 200 response. In general, you implement a function that takes an event and a context parameter. The event is a JSON-formatted document encapsulating data for a Lambda function to process. For example, if the Lambda function handles HTTP requests, the event will contain HTTP headers and the request body. The context contains metadata about the function and runtime environment, such as the function version number and available memory in the execution environment:

导入 json

def lambda_handler(事件,上下文):
     event_body = json.loads(事件['body'])
     响应={
        “状态代码”:200,
        'body': json.dumps({ event_body['message'] })
    }

    返回响应
import json

def lambda_handler(event, context):
     event_body = json.loads(event[‘body’])
     response = {
        'statusCode': 200,
        'body': json.dumps({ event_body[‘message’] })
    }

    return response

外部客户端可以通过 HTTP 调用 Lambda 函数。它们还可以与其他 AWS 服务紧密集成。例如,这使得 Lambda 函数能够在新数据写入 AWS S3 存储服务或监控事件发送到 AWS CloudWatch 服务时动态触发。如果您的应用程序深深嵌入到 AWS 生态系统中,则 Lambda 函数在设计和部署架构时非常有用。

Lambda functions can be invoked by external clients over HTTP. They can also be tightly integrated with other AWS services. For example, this enables Lambda functions to be dynamically triggered when new data is written to the AWS S3 storage service or a monitoring event is sent to the AWS CloudWatch service. If your application is deeply embedded in the AWS ecosystem, Lambda functions can be of great utility in designing and deploying your architecture.

鉴于无服务器平台之间的核心相似之处,在本节中,我将仅从可扩展性和成本角度关注 Lambda 的差异化功能。

Given the core similarities between serverless platforms, in this section I’ll just focus on the differentiating features of Lambda from a scalability and cost perspective.

Lambda 函数生命周期

Lambda Function Life Cycle

Lambda 函数可以内置多个语言并支持常见的服务容器,例如 Java 的 Spring 和 Python 的 Flask。对于每个支持的语言,即 Node.js、Python、Ruby、Java、Go 和基于 .NET 的代码,Lambda 支持许多运行时版本。运行时环境版本在部署时与代码一起指定,并以压缩格式上传到 Lambda。2

Lambda functions can be built in a number of languages and support common service containers such as Spring for Java and Flask for Python. For each supported language, namely Node.js, Python, Ruby, Java, Go, and .NET-based code, Lambda supports a number of runtime versions. The runtime environment version is specified at deployment time along with the code, which is uploaded to Lambda in a compressed format.2

Lambda 函数必须设计为无状态,以便 Lambda 运行时环境可以按需扩展服务。当对 Lambda 函数定义的 API 的请求首次到达时,Lambda 下载该函数的代码,初始化运行时环境和任何特定于实例的初始化(例如,创建数据库连接),最后调用函数代码处理程序。

Lambda functions must be designed to be stateless so that the Lambda runtime environment can scale the service on demand. When a request first arrives for the API defined by the Lambda function, Lambda downloads the code for the function, initializes a runtime environment and any instance specific initialization (e.g., creating a database connection), and finally invokes the function code handler.

这个初始调用被称为冷启动,所需时间取决于所选的语言环境、函数代码的大小以及函数初始化所需的时间。与 GAE 一样,Node.js 和 Go 等轻量级语言通常需要几百毫秒来初始化,而 Java 或 .NET 则较重,可能需要一秒或更长时间。

This initial invocation is known as a cold start, and the time taken is dependent on the language environment selected, the size of the function code, and time taken to initialize the function. Like in GAE, lightweight languages such as Node.js and Go will typically take a few hundred milliseconds to initialize, whereas Java or .NET are heavier weight and can take a second or more.

一旦API执行完成,Lambda就可以使用部署的函数运行时环境来处理后续请求。这意味着不会产生冷启动成本。但是,如果大量请求同时到达,则将初始化多个运行时实例,每个请求一个。与 GAE 不同,Lambda 不会向同一运行时实例发送多个并发请求。这意味着由于冷启动成本,所有这些同时请求将产生额外的响应时间。

Once an API execution is completed, Lambda can use the deployed function runtime environment for subsequent requests. This means cold start costs are not incurred. However, if a burst of requests arrive simultaneously, multiple runtime instances will be initialized, one for each request. Unlike GAE, Lambda does not send multiple concurrent requests to the same runtime instance. This means all these simultaneous requests will incur additional response times due to cold start costs.

如果新请求未到达且常驻运行时实例未立即重新利用,Lambda将冻结执行环境。如果后续请求到达,环境将被解冻并重新使用。如果没有更多请求到达该函数,则在平台控制的分钟数后,Lambda 将停用冻结的实例,以便它不会继续消耗平台资源。3

If a new request does not arrive and a resident runtime instance is not immediately reutilized, Lambda freezes the execution environment. If subsequent requests arrive, the environment is thawed and reused. If more requests do not arrive for the function, after a platform-controlled number of minutes Lambda will deactivate a frozen instance so it does not continue to consume platform resources.3

通过使用预配置并发可以降低冷启动成本。这告诉 Lambda 保持最少数量的运行时实例常驻并准备好处理请求,而无需冷启动开销。当然,“没有免费的午餐”原则适用,并且费用根据配置实例的数量而增加。您还可以使Lambda 函数是 AWS Application Load Balancer (ALB) 的目标,其方式与第 5 章中讨论的方式类似。例如,负载均衡器策略增加了某个函数的预配置并发性可以定义指定的时间,以预测流量的增加。

Cold start costs can be mitigated by using provisioned concurrency. This tells Lambda to keep a minimum number of runtime instances resident and ready to process requests with no cold start overheads. The “no free lunch” principle applies of course, and charges increase based on the number of provisioned instances. You can also make a Lambda function a target of an AWS Application Load Balancer (ALB), in a similar fashion to that discussed in Chapter 5. For example, a load balancer policy that increases the provisioned concurrency for a function at a specified time, in anticipation of an increase in traffic, can be defined.

执行注意事项

Execution Considerations

当您定义 Lambda 函数时,您指定应分配给其运行时环境的内存量。与 GAE 不同,您无需指定要使用的 vCPU 数量。相反,计算能力是根据指定的内存(介于 128 MB 到 10 GB 之间)成比例分配的。

When you define a Lambda function, you specify the amount of memory that should be allocated to its runtime environment. Unlike GAE, you do not specify the number of vCPUs to utilize. Rather, the computation power is allocated in proportion to the memory specified, which is between 128 MB and 10 GB.

Lambda 函数按每毫秒的执行收费。每毫秒的成本随着分配给运行时环境的内存量的增加而增加。例如,在撰写本文时,2 GB 实例的每毫秒成本是1 GB 实例的两倍。然而,Lambda 并没有具体说明额外的内存能为您的函数带来多少计算能力。不过,分配的内存量越大,您的 Lambda 函数的执行速度可能就越快。4

Lambda functions are charged for each millisecond of execution. The cost per millisecond grows with the amount of memory allocated to the runtime environment. For example, at the time of writing the costs per millisecond for a 2 GB instance are twice that of a 1 GB instance. Lambda does not specify precisely how much more compute capacity this additional memory buys your function, however. Still, the larger the amount of memory allocated, then the faster your Lambda functions will likely execute.4

这种情况在性能和成本之间产生了微妙的权衡。让我们根据上述 1 GB 和 2 GB 实例的成本来检查一个简单的示例,并假设在 1 GB 实例上执行 1 毫秒会产生 1 个神话成本单位,在 2 GB 实例上执行 1 毫秒会产生 2 个神话成本单位。

This situation creates a subtle trade-off between performance and costs. Let’s examine a simple example based on the costs for 1 GB and 2 GB instances mentioned above, and assume that 1 millisecond of execution on a 1 GB instance incurs 1 mythical cost unit, and a millisecond on a 2 GB instance incurs 2 mythical cost units.

对于 1 GB 内存,我假设该函数在 40 毫秒内执行,因此会产生 40 个成本单位。在分配 2 GB 内存和相应更多的 CPU 分配的情况下,相同的函数需要 10 毫秒,这意味着您从 AWS 钱包中拿出 20 个成本单位。因此,您的账单将减少 50%,并且通过为函数分配更多内存,您的执行速度将提高 4 倍。调整肯定能带来红利。

With 1 GB of memory, I’ll assume this function executes in 40 milliseconds, thus incurring 40 cost units. With 2 GB of memory allocated, and commensurately more CPU allocation, the same function takes 10 milliseconds, meaning you part with 20 cost units from your AWS wallet. Hence your bills will be reduced by 50% and you will get 4x faster execution by allocating more memory to the function. Tuning can surely pay dividends.

这显然非常依赖于 Lambda 函数执行的实际处理。不过,如果您的服务每月执行数十亿次,这种有些不直观的调整练习可能会带来显着的成本节省和更大的可扩展性。

This is obviously very dependent on the actual processing your Lambda function performs. Still, if your service is executed several billion times a month, this kind of somewhat nonintuitive tuning exercise may result in significant cost savings and greater scalability.

找到以相似或更低的成本提供更快响应时间的最佳点是一项性能调整实验,可以大规模带来高额红利。Lambda 使这是一项相对简单的实验,因为只有一个参数(内存分配)需要改变。本章后面的案例研究将解释一种可用于 GAE 等平台的方法,这些平台具有多个控制可扩展性和成本的相互依赖的参数。

Finding this sweet spot that provides faster response times at similar or lower costs is a performance tuning experiment that can pay high dividends at scale. Lambda makes this a relatively straightforward experiment to perform as there is only one parameter (memory allocation) to vary. The case study later in this chapter will explain an approach that can be used for platforms such as GAE, which have multiple interdependent parameters that control scalability and costs.

可扩展性

Scalability

随着并发请求数对于功能的增加,Lambda 将部署更多运行时实例来扩展处理。如果请求负载继续增长,Lambda 会重用可用实例并根据需要创建新实例。最终,当请求负载下降时,Lambda 通过停止未使用的实例来缩小规模。无论如何,这是简单的版本。事实上,情况要复杂一些。

As the number of concurrent requests for a function increases, Lambda will deploy more runtime instances to scale the processing. If the request load continues to grow, Lambda reuses available instances and creates new instances as needed. Eventually, when the request load falls, Lambda scales down by stopping unused instances. That’s the simple version, anyway. In reality, it is a tad more complicated.

所有 Lambda 函数都具有内置的请求突发并发限制。有趣的是,这个默认爆了限制因部署该函数的 AWS 区域而异。例如,在美国西部(俄勒冈),一个函数可以扩展到 3,000 个实例来处理突发请求,而在欧洲(法兰克福),该限制为 1,000 个实例。5

All Lambda functions have a built-in concurrency limit for request bursts. Interestingly, this default burst limit varies depending on the AWS region where the function is deployed. For example, in US West (Oregon), a function can scale up to 3,000 instances to handle a burst of requests, whereas in Europe (Frankfurt) the limit is 1,000 instances.5

无论在哪个区域,一旦达到突发限制,函数就可以以每分钟 500 个实例的速度扩展。这种情况一直持续到需求得到满足并且请求开始下降为止。如果请求负载超过每分钟可处理 500 个额外实例的容量,Lambda 将限制该函数并向客户端返回 HTTP 429,客户端必须重试该请求。

Regardless of the region, once the burst limit is reached, a function can scale at a rate of 500 instances per minute. This continues until the demand is satisfied and requests start to drop off. If the request load exceeds the capacity that can be processed by 500 additional instances per minute, Lambda throttles the function and returns an HTTP 429 to clients, who must retry the request.

这种行为如图 8-2所示。在请求突发期间,实例数量快速增长,直至达到区域定义的突发限制。此后,每分钟只能部署 500 个新实例。在此期间,可用实例无法满足的请求将受到限制。随着请求负载下降,实例将从平台中删除,直到流量恢复稳定状态。

This behavior is depicted in Figure 8-2. During the request burst, the number of instances grows rapidly up to the region-defined burst limit. After that, only 500 new instances can be deployed per minute. During this time, requests that cannot be satisfied by the available instances are throttled. As the request load drops, instances are removed from the platform until a steady state of traffic resumes.

函数可以处理的确切并发客户端请求数量取决于该函数的处理时间。例如,假设我们部署了 3,000 个实例,每个请求平均需要 100 毫秒来处理。这意味着每个实例每秒可以处理 10 个请求,每秒的最大吞吐量为 (3,000 × 10) = 30,000 个请求。

Precisely how many concurrent client requests a function can handle depends on the processing time for the function. For example, assume we have 3,000 deployed instances, and each request takes on average 100 milliseconds to process. This means that each instance can process 10 requests per second, giving a maximum throughput of (3,000 × 10) = 30,000 requests per second.

扩展 AWS Lambda 函数
图 8-2。扩展 AWS Lambda 函数

为了完整了解这一情况,您需要注意突发并发限制实际上适用于与单个 AWS 账户关联的区域中的所有功能。因此,如果您在同一区域的一个账户下部署三个不同的 Lambda 函数,则它们部署的实例总数由决定扩展行为的突发限制控制。这意味着,如果一个功能突然意外地重载,它可能会消耗突发限制,并对希望同时扩展的其他功能的可用性产生负面影响。

To complete the picture, you need to be aware that the burst concurrency limit actually applies to all functions in the region associated with a single AWS account. So, if you deploy three different Lambda functions in the same region under one account, their collective number of deployed instances is controlled by the burst limit that determines the scaling behavior. This means if one function is suddenly and unexpectedly heavily loaded, it can consume the burst limit and negatively impact the availability of other functions that wish to scale at the same time.

为了解决这种潜在冲突,您可以微调与同一区域的同一 AWS 账户下部署的每个单独 Lambda 函数关联的并发级别。6这称为保留并发。每个单独的功能都可以与小于突发限制的值相关联。7该值定义可以同时执行的该函数的最大实例数。

To address this potential conflict, you can fine-tune the concurrency levels associated with each individual Lambda function deployed under the same AWS account in the same region.6 This is known as reserved concurrency. Each individual function can be associated with a value that is less than the burst limit.7 This value defines the maximum number of instances of that function that can be executed concurrently.

预留并发有两个含义:

Reserved concurrency has two implications:

  • 具有保留并发性的 Lambda 函数始终具有专用于其自身调用的执行能力。它不会因该区域中其他函数的并发调用而意外地饥饿。

  • The Lambda function with reserved concurrency always has execution capacity available exclusively for its own invocations. It cannot be unexpectedly starved by concurrent invocations of other functions in the region.

  • 保留容量限制了该函数的最大驻留实例数。当实例数量达到保留值时无法处理的请求将失败并出现 HTTP 429 错误。

  • The reserved capacity caps the maximum number of resident instances for that function. Requests that cannot be processed when the number of instances is at the reserved value fail with an HTTP 429 error.

从本次讨论中可以明显看出,AWS Lambda 提供了强大而灵活的无服务器环境。只要小心,运行时环境就可以配置为有效扩展,以处理大容量、突发的请求负载。它已成为许多组织内部和面向客户的应用程序的 AWS 工具箱不可或缺的一部分。8

As should be apparent from this discussion, AWS Lambda provides a powerful and flexible serverless environment. With care, the runtime environment can be configured to scale effectively to handle high-volume, bursty request loads. It has become an integral part of the AWS toolbox for many organizations’ internal and customer-facing applications.8

案例研究:平衡吞吐量和成本

Case Study: Balancing Throughput and Costs

从无服务器平台以最低成本获得所需的性能和可扩展性几乎总是需要调整运行时参数设置。当您的申请是每天可能处理数百万个请求,即使成本降低 10% 也可以节省大量资金。当然,足以让你的老板和客户满意。

Getting the required performance and scalability at lowest cost from a serverless platform almost always requires tweaking of the runtime parameter settings. When your application is potentially processing many millions of requests per day, even a 10% cost reduction can result in significant monetary savings. Certainly, enough to make your boss and clients happy.

所有无服务器平台的可调整参数设置均有所不同。有些相对简单,例如 AWS Lambda,其中选择函数的内存量是主要的调整参数。另一个极端可能是 Azure Functions,它具有多个参数设置和部署限制,这些设置和部署限制根据选择的三个托管计划中的哪一个而有所不同。9

All serverless platforms vary in the parameter settings you can tune. Some are relatively straightforward, such as AWS Lambda in which choosing the amount of memory for a function is the dominant tuning parameter. The other extreme is perhaps Azure Functions, which has multiple parameter settings and deployment limits that differ based on which of three hosting plans are selected.9

GAE 位于这两者之间,具有一些控制自动缩放行为的参数。我将以此作为如何进行应用程序调优的示例。

GAE sits between these two, with a handful of parameters that govern autoscaling behavior. I’ll use this as an example of how to approach application tuning.

选择参数值

Choosing Parameter Values

有三个主要参数控制正如我在本章前面所解释的,GAE 如何自动缩放应用程序。表 8-1列出了这些参数以及可能的值范围。

There are three main parameters that govern how GAE autoscales an application, as I explained earlier in this chapter. Table 8-1 lists these parameters along with possible values ranges.

表 8-1。GAE 自动缩放参数
参数名称 最低限度 最大限度 默认
target_throughput_utilization 0.5 0.95 0.6
target_cpu_utilization 0.5 0.95 0.6
max_concurrent_requests 1 80 10

给定这些范围,软件架构师面临的问题很简单,就是如何选择参数值,以最低的成本提供所需的性能和可扩展性?也许最困难的部分是弄清楚从哪里开始。

Given these ranges, the question for a software architect is, simply, how do you choose the parameter values that provide the required performance and scalability at lowest cost? Probably the hardest part is figuring out where to start.

即使使用三个参数,也存在大量可能相互交互的可能设置组合。您如何知道您的参数设置可以为您的用户和预算提供尽可能接近最佳的服务?有一些很好的一般性建议,但您仍然面临为应用程序选择参数值的问题。

Even with three parameters, there is a large combination of possible settings that, potentially, interact with each other. How do you know that you have parameter settings that are serving both your users and your budgets as close to optimal as possible? There’s some good general advice available, but you are still left with the problem of choosing parameter values for your application.

表 8-1中列出的三个参数就有大约 170K 种不同的配置。您无法测试所有这些。如果您戴上工程帽子,只考虑吞吐量和 CPU 利用率以 0.05 为增量的值,以及最大并发请求以 10 为增量的值,您最终仍然会得到大约 648 种可能的配置。探索这一点是完全不切实际的,特别是因为我们确实事先不知道我们的服务行为对任何参数值设置有多敏感。所以,你可以做什么?

For just the three parameters listed in Table 8-1, there are approximately 170K different configurations. You can’t test all of them. If you put your engineering hat on, and just consider values in increments of 0.05 for throughput and CPU utilization, and increments of 10 for maximum concurrent requests, you still end up with around 648 possible configurations. That is totally impractical to explore, especially as we really don’t know a priori how sensitive our service behavior is going to be to any parameter value setting. So, what can you do?

调整系统的一种方法是进行参数研究。该方法也称为参数研究,包括三个基本步骤:

One way to approach tuning a system is to undertake a parameter study. Also known as a parametric study, the approach comprises three basic steps:

  • 指定评估参数。

  • Nominate the parameters for evaluation.

  • 定义参数范围和这些范围内的离散值。

  • Define the parameter ranges and discrete values within those ranges.

  • 分析并比较每个参数变化的结果。

  • Analyze and compare the results of each parameter variation.

为了说明这种方法,我将引导您通过一个基于表 8-1中的三个参数的示例。目的是找到能够以最低成本实现最高吞吐量的参数设置。接受测试的应用程序是一个 GAE Go 服务,它对 Google Firestore 数据库执行读取和写入操作。应用程序逻辑很简单,基本上执行三个步骤:

To illustrate this approach, I’ll lead you through an example based on the three parameters in Table 8-1. The aim is to find the parameter settings that give ideally the highest throughput at the lowest cost. The application under test was a GAE Go service that performs reads and writes to a Google Firestore database. The application logic was straightforward, basically performing three steps:

  • 输入参数验证

  • Input parameter validation

  • 数据库访问

  • Database access

  • 格式化并返回结果

  • Formatting and returning results

写入请求与读取请求的比例为 80% 到 20%,因此定义了写入密集型工作负载。我还使用了一个负载测试器,它从 512 个并发客户端线程生成不间断的请求流峰值负载,具有 128 个客户端线程的短暂预热和冷却阶段。

The ratio of write to read requests was 80% to 20%, thus defining a write-heavy workload. I also used a load tester that generated an uninterrupted stream of requests from 512 concurrent client threads at peak load, with short warm-up and cooldown phases of 128 client threads.

GAE 自动缩放参数研究设计

GAE Autoscaling Parameter Study Design

对于明确定义的参数研究,您需要:

For a well-defined parameter study, you need to:

  • 选择感兴趣的参数范围。

  • Choose the parameter ranges of interest.

  • 在每个参数的定义范围内,选择一个或两个中间值。

  • Within the defined ranges for each parameter, choose one or two intermediate values.

对于示例 Go 应用程序简单的业务逻辑和数据库访问,直觉似乎表明默认的 GAE CPU 利用率和并发请求设置偏低。因此,我选择改变这两个参数,其值如下:

For the example Go application with simple business logic and database access, intuition seems to point to the default GAE CPU utilization and concurrent request settings to be on the low side. Therefore, I chose these two parameters to vary, with the following values:

  • target_cpu_utilization:{0.6,0.7。0.8}

  • target_cpu_utilization: {0.6, 0.7. 0.8}

  • max_concurrent_requests: {10, 35, 60, 80}

  • max_concurrent_requests: {10, 35, 60, 80}

这定义了 12 种不同的应用程序配置,如表 8-2中的条目所示。

This defines 12 different application configurations, as shown by the entries in Table 8-2.

表 8-2。参数研究选定值
cpu_utilization
0.6 10 35 60 80
0.7 10 35 60 80
0.8 10 35 60 80

下一步是对 12 种配置中的每一种配置运行负载测试。这很简单,花了两天时间几个小时。您的负载测试工具将捕获各种测试统计数据。在此示例中,您最感兴趣的是获得的总体平均吞吐量以及执行每个测试的成本。后者应该可以直接从可用的无服务器监控工具中获取。

The next step is to run load tests on each of the 12 configurations. This was straightforward and took a few hours over two days. Your load-testing tool will capture various test statistics. In this example, you are most interested in overall average throughput obtained and the cost of executing each test. The latter should be straightforward to obtain from the serverless monitoring tools available.

现在,我将继续讨论真正有趣的部分——结果。

Now, I’ll move on to the really interesting part—the results.

结果

Results

表 8-3显示了每个测试配置的平均吞吐量。{CPU80, max10} 配置提供每秒 6,178 个请求的最高吞吐量。该值比默认设置 {CPU60, max10} 提供的值高 1.7%,比每秒 5,605 个请求的最低吞吐量高约 9%。因此,结果显示吞吐量从最低到最高大约有 10% 的变化。相同的代码。相同的请求负载。配置参数不同。

Table 8-3 shows the mean throughput for each test configuration. The highest throughput of 6,178 requests per second is provided by the {CPU80, max10} configuration. This value is 1.7% higher than that provided by the default settings {CPU60, max10}, and around 9% higher than the lowest throughput of 5,605 requests per second. So the results show a roughly 10% variation from lowest to highest throughput. Same code. Same request load. Different configuration parameters.

表 8-3。每个测试配置的平均吞吐量
吞吐量 最大10 最大35 最大60 最大80
中央处理器60 6,006 6,067 5,860 5,636
中央处理器70 6,064 6,121 5,993 5,793
中央处理器80 6,178 5,988 5,989 5,605

现在我将考虑成本因素。在表 8-4中,我通过默认 GAE 配置 {CPU60, max10} 的成本对每次测试运行的成本进行了标准化。例如,{CPU70, max10} 配置的成本比默认值高 18%,而 {CPU80, max80} 配置的成本比默认值低 45%。

Now I’ll factor in cost. In Table 8-4, I’ve normalized the cost for each test run by the cost of the default GAE configuration {CPU60, max10}. So, for example, the cost of the {CPU70, max10} configuration was 18% higher than the default, and the cost of the {CPU80, max80} configuration was 45% lower than the default.

表 8-4。每个测试配置的平均成本标准化为默认配置成本
标准化实例时间 最大10 最大35 最大60 最大80
中央处理器60 100% 72% 63% 63%
中央处理器70 118% 82% 63% 55%
中央处理器80 100% 72% 82% 55%

从这些结果中我们可以得出一些相当有趣的观察结果:

There are several rather interesting observations we can make from these results:

  • 默认设置 {CPU60, max10} 既不能提供最高性能,也不能提供最低成本。这种配置让谷歌满意,但也许不是你的客户。

  • The default settings {CPU60, max10} give neither the highest performance nor lowest cost. This configuration makes Google happy, but maybe not your client.

  • 在与默认配置相同的成本下,我们使用 {CPU80, max10} 配置获得了 3% 的更高性能。

  • We obtain 3% higher performance with the {CPU80, max10} configuration at the same cost of the default configuration.

  • 与默认配置设置相比,{CPU70, max35} 配置的性能略有提高(约 2%),成本降低了 18%。

  • We obtain marginally (approximately 2%) higher performance with 18% lower costs from the {CPU70, max35} configuration as compared to the default configuration settings.

  • 使用 {CPU70, max80} 测试配置,我们以 55% 的成本获得了 96% 的默认配置性能。对于吞吐量稍低的情况来说,这是相当可观的成本节省。

  • We obtain 96% of the default configuration performance at 55% of the costs with the {CPU70, max80} test configuration. That is a pretty decent cost saving for slightly lower throughput.

有了这些信息,您就可以选择最能平衡您的成本和性能需求的配置设置。对于多个相互依赖的配置参数,您不太可能通过直觉和专业知识找到“最佳”设置。有太多相互交织的因素在起作用,导致这种情况的发生。参数研究可让您快速、严格地探索一系列参数设置。通过两个或三个参数以及每个参数的三个或四个值,您可以快速且廉价地探索参数空间。这使您能够看到组合的效果值并就如何部署应用程序做出明智的决策。

Armed with this information, you can choose the configuration settings that best balance your costs and performance needs. With multiple, dependent configuration parameters, you are unlikely to find the “best” setting through intuition and expertise. There are too many intertwined factors at play for that to happen. Parameter studies let you quickly and rigorously explore a range of parameter settings. With two or three parameters and three or four values for each, you can explore the parameter space quickly and cheaply. This enables you to see the effects of the combinations of values and make educated decisions on how to deploy your application.

总结和延伸阅读

Summary and Further Reading

无服务器平台是构建可扩展应用程序的强大工具。它们消除了与管理和更新显式分配的虚拟机集群相关的许多部署复杂性。部署就像开发服务代码并将其与配置文件一起上传到平台一样简单。您使用的无服务器平台会负责剩下的工作。

Serverless platforms are a powerful tool for building scalable applications. They eliminate many of the deployment complexities associated with managing and updating clusters of explicitly allocated virtual machines. Deployment is as simple as developing the service’s code, and uploading it to the platform along with a configuration file. The serverless platform you are using takes care of the rest.

无论如何,理论上来说。

In theory, anyway.

当然,在实践中,您可以使用一些重要的旋钮和旋钮来调整底层无服务器平台管理功能的方式。这些都是特定于平台的,但许多都与性能和可扩展性以及最终您支付的金额有关。本章中的案例研究说明了这种关系,并为您提供了一种方法,您可以利用该方法找到难以捉摸的最佳点,以比默认平台参数设置更低的成本提供所需的性能。

In practice, of course, there are important dials and knobs that you can use to tune the way the underlying serverless platforms manage your functions. These are all platform-specific, but many relate to performance and scalability, and ultimately the amount of money you pay. The case study in this chapter illustrated this relationship and provided you with an approach you can utilize to find that elusive sweet spot that provides the required performance at lower costs than the default platform parameter settings provide.

要充分利用无服务器计算的优势,您需要购买云服务提供商的资金。有很多可供选择,但如果您决定迁移到新平台,所有这些都会伴随着供应商锁定和下游痛苦和痛苦。

Exploiting the benefits of serverless computing requires you to buy into a cloud service provider. There are many to choose from, but all come with the attendant vendor lock-in and downstream pain and suffering if you ever decide to migrate to a new platform.

有一些开源无服务器平台,例如Apache OpenWhisk,可以部署到本地硬件或云配置的虚拟资源。还有一些独立于提供商的解决方案,例如无服务器框架。这些使得将无服务器编写的应用程序部署到许多主流云提供商(包括所有常见的云提供商)成为可能。这提供了代码可移植性,但并不能使系统免受不同提供商部署环境的复杂性的影响。不可避免的是,在新平台上实现所需的性能、可扩展性和安全性并不是一件容易的事。

There are open source serverless platforms such as Apache OpenWhisk that can be deployed to on-premises hardware or cloud-provisioned virtual resources. There are also solutions such as the Serverless Framework that are provider-independent. These make it possible to deploy applications written in Serverless to a number of mainstream cloud providers, including all the usual suspects. This delivers code portability but does not insulate the system from the complexities of different provider deployment environments. Inevitably, achieving the required performance, scalability, and security on a new platform is not going to be a walk in the park.

Jason Katzer 的Learning Serverless(O'Reilly,2020)是有关无服务器计算的重要信息来源。我还推荐两篇非常有趣的文章,讨论无服务器计算的当前技术水平和未来的可能性。这些都是:

A great source of information on serverless computing is Jason Katzer’s Learning Serverless (O’Reilly, 2020). I’d also recommend two extremely interesting articles that discuss the current state of the art and future possibilities for serverless computing. These are:

  • D. Taibi 等人,“无服务器计算:我们现在在哪里,我们将走向何方?” IEEE 软件。38,没有。1(2021 年 1 月至 2 月):25-31,doi:10.1109/MS.2020.3028708。

  • D. Taibi et al., “Serverless Computing: Where Are We Now, and Where Are We Heading?” IEEE Software. 38, no. 1 (Jan.–Feb. 2021): 25–31, doi: 10.1109/MS.2020.3028708.

  • J. Schleier-Smith 等人,“无服务器计算是什么以及应该成为什么:云计算的下一阶段”,ACM 64 通讯,第 1 期。5(2021 年 5 月):76–84。

  • J. Schleier-Smith et al., “What Serverless Computing Is and Should Become: The Next Phase of Cloud Computing,” Communications of the ACM 64, no. 5 (May 2021): 76–84.

最后,无服务器平台是实现微服务架构的常用技术。微服务是一种架构模式,用于将应用程序分解为多个独立可部署和可扩展的部分。这种设计方法非常适合基于无服务器的实现,并且方便,这是我们在下一章中讨论的主题。

Finally, serverless platforms are a common technology for implementing microservices architectures. Microservices are an architectural pattern for decomposing an application into multiple independently deployable and scalable parts. This design approach is highly amenable to a serverless-based implementation, and conveniently, is the topic we cover in the next chapter.

1还有一个可选min-pending-latency参数,默认值为零。如果您足够勇敢,本文档将解释最小值和最大值如何协同工作。

1 There’s also an optional min-pending-latency parameter, with a default value of zero. If you are brave, how the minimum and maximum values work together is explained in this documentation.

2自 2021 年起,Lambda 还支持使用 Docker 容器构建的服务。这使得开发人员在创建容器映像时可以选择语言运行时。

2 As of 2021, Lambda also supports services that are built using Docker containers. This gives the developer the scope to choose language runtime when creating the container image.

3 这个实验描述了空闲函数驻留的时间。

3 This experiment describes how long idle functions are kept resident.

4根据AWS Lambda 文档,“一个函数的大小为 1,769 MB,相当于一个 vCPU(每秒一个 vCPU 秒积分)。”

4 Per the AWS Lambda documentation, “At 1,769 MB, a function has the equivalent of one vCPU (one vCPU-second of credits per second).”

5老客户可以与 AWS 协商以提高这些限制。

5 Established customers can negotiate with AWS to increase these limits.

6或者,如果 Lambda 在不同的应用程序中使用,则可以将其分为不同的账户。不过,AWS 账户的设计和使用不属于本书的讨论范围。

6 Alternatively, if the Lambda usage is across different applications, it could be separated into different accounts. AWS account design and usage is, however, outside the scope of this book.

7实际上,函数的最大保留并发数是(Burst Limit –100)。AWS 为与显式并发限制无关的所有函数保留 100 个并发实例。这确保所有功能都可以访问一些备用容量来执行。

7 Actually, this maximum reserved concurrency for a function is the (Burst Limit –100). AWS reserves 100 concurrent instances for all functions that are not associated with explicit concurrency limits. This ensures that all functions have access to some spare capacity to execute.

8请参阅https://oreil.ly/nVnNe了解一组有趣的内容来自 Lambda 用户的精选案例研究。

8 See https://oreil.ly/nVnNe for an interesting set of curated case studies from Lambda users.

9文档中介绍了缩放 Azure 函数。

9 Scaling Azure functions is covered in the documentation.

第 9 章微服务

Chapter 9. Microservices

您通常不会看到主流软件架构风格与意大利风格的全球流行美食之间存在紧密的联系。然而,微服务和披萨就是这种情况。微服务的根源可以追溯到 2008 年左右,当时该方法是由众所周知的互联网巨头大规模开创的。在亚马逊,“双比萨规则”成为单个系统组件团队规模的管理原则,该组件随后被称为微服务。两块披萨的规则是什么?很简单,每个内部团队都应该足够小,可以吃两个披萨。

You don’t often see strong links between a mainstream software architectural style and an Italian-inspired, globally popular cuisine. This is, however, the case with microservices and pizza. The roots of microservices can be traced back to around 2008 when the approach was pioneered at scale by the internet giants we all know. At Amazon, the “two-pizza rule” emerged as a governing principle of team size for a single system component, which subsequently became known as a microservice. What is the two-pizza rule? Very simply, every internal team should be small enough that it can be fed with two pizzas.

然而,认为微服务在某种意义上比服务更小是一种误解。微服务的定义特征是其围绕业务功能组织的范围。简而言之,微服务是一种设计和部署细粒度、高内聚和松耦合服务的方法,这些服务是为了满足系统的需求而组合的。这些细粒度服务或微服务是独立部署的,并且必须在必要时进行通信和协调,以处理各个系统请求。因此,从本质上讲,微服务架构是分布式系统,必须处理我在前面的章节中描述的各种可扩展性、性能和可用性问题。

It is a misconception, however, that microservices are in some sense smaller than a service. The defining characteristic of a microservice is their scope, organized around a business capability. Put very simply, microservices are an approach to designing and deploying fine-grained, highly cohesive, and loosely coupled services that are composed to fulfill the system’s requirements. These fine-grained services, or microservices, are independently deployed and must communicate and coordinate when necessary to handle individual system requests. Hence, by their very nature, microservices architectures are distributed systems, and must deal with the various scalability, performance, and availability issues I have described in previous chapters.

微服务是一种流行的现代架构风格,具有大量的工程优势在正确的背景下。例如,具有单一微服务职责的小型敏捷团队可以快速迭代和发展功能,并独立部署更新的版本。每个微服务对于系统的其余部分来说都是一个黑匣子,可以在内部选择最适合团队和应用程序需求的架构和技术堆栈。主要的新系统功能可以构建为微服务并组合到应用程序架构中,同时对系统其余部分的影响最小。

Microservices are a popular, modern architectural style with plenty of engineering advantages in the right context. For example, small, agile teams with single microservice responsibilities can iterate and evolve features quickly, and deploy updated versions independently. Each microservice is a black box to the rest of the system and can choose an architecture and technology stack internally that best suits the team’s and application’s needs. Major new system functionalities can be built as microservices and composed into the application architecture with minimal impact on the rest of the system.

在本章中,我将简要描述微服务并解释它们的关键特征。我将介绍微服务方法背后的主要工程和架构原则,并提供一般设计知识的优秀来源的指导。鉴于本书的主题,本章的主要焦点是微服务固有的分布式特性以及它们的大规模行为方式。我将描述当耦合微服务置于负载下时出现的一些问题以及您需要设计到架构中以构建可扩展、弹性应用程序的解决方案。

In this chapter, I’ll briefly describe microservices and explain their key characteristics. I’ll touch on the major engineering and architectural principles behind a microservices approach and provide pointers to excellent sources of general design knowledge. The main focus of the chapter, given the topic of this book, is the inherently distributed nature of microservices and how they behave at scale. I will describe some problems that emerge as coupled microservices are placed under load and solutions that you need to design into your architecture to build scalable, resilient applications.

微服务运动

The Movement to Microservices

在许多方面,基于微服务的架构都受益于过去十年中出现的软件工程和技术创新的融合。敏捷的小型团队、持续开发和集成实践以及部署技术共同为微服务所体现的细粒度架构方法提供了肥沃的土壤。基于微服务的架构是利用这些进步来部署灵活、可扩展和可伸缩系统的催化剂。让我们来看看它们的起源和一些特征。

In many ways, microservice-based architectures have benefited from a confluence of software engineering and technology innovation that has emerged over the last decade. Small, agile teams, continuous development and integration practices, and deployment technologies have collectively provided fertile ground for the fine-grained architectural approach embodied by microservices. Microservice-based architectures are a catalyst for exploiting these advances to deploy flexible, extensible, and scalable systems. Let’s examine their origins and some features.

整体应用程序

Monolithic Applications

自 IT 系统出现以来,整体架构风格主导了企业应用程序。本质上,这种风格将应用程序分解为多个逻辑模块或服务,这些模块或服务作为单个应用程序构建和部署。这些服务提供可由外部客户端调用的端点。端点提供安全性和输入验证,然后将请求委托给共享业务逻辑,而共享业务逻辑又将通过数据访问对象(DAO)层。图 9-1中描述了一个示例大学管理系统的设计,该系统具有处理学生课程作业和时间表、房间安排、费用支付以及教师和顾问交互的功能

Since the dawn of IT systems, the monolithic architectural style has dominated enterprise applications. Essentially, this style decomposes an application into multiple logical modules or services, which are built and deployed as a single application. These services offer endpoints that can be called by external clients. Endpoints provide security and input validation and then delegate the requests to shared business logic, which in turn will access a persistent store through a data access objects (DAO) layer. This design is depicted in Figure 9-1 for an example university management system that has capabilities to handle student course assignments and timetables, room scheduling, fee payments, and faculty and advisor interactions

这种架构鼓励创建可在服务实现之间共享的可重用业务逻辑和 DAO。DAO 映射到数据库实体,所有服务实现共享一个数据库。

This architecture encourages the creation of reusable business logic and DAOs that can be shared across service implementations. DAOs are mapped to database entities, and all service implementations share a single database.

IBM WebSphere 和 Microsoft .NET 等流行平台支持将所有服务构建和部署为单个可执行包。这就是术语“单体”(完整的应用程序)的由来。API、业务逻辑、数据访问等等都包含在单个部署工件中。

Popular platforms such as IBM WebSphere and Microsoft .NET enable all the services to be built and deployed as a single executable package. This is where the term monolith—the complete application—originates. APIs, business logic, data access, and so forth are all wrapped up in a single deployment artifact.

整体应用程序示例
图 9-1。整体应用程序示例

考虑到该方法的寿命长,单体应用程序具有许多优点,这并不奇怪。该架构方法很好理解,为新应用程序提供了坚实的基础。它在多种语言的开发框架中享有广泛的自动化。测试很简单,部署也很简单,因为只需管理一个应用程序包。由于应用程序在一台(可能非常强大)服务器上运行,因此系统和错误监控也得到了简化。

Monolithic applications, unsurprisingly given the longevity of the approach, have many advantages. The architectural approach is well understood and provides a solid foundation for new applications. It enjoys extensive automation in development frameworks in many languages. Testing is straightforward, as is deployment as there is just a single application package to manage. System and error monitoring is also simplified as the application runs on one (probably quite powerful) server.

扩大规模是最简单的方法提高整体应用程序的响应能力和容量。横向扩展也是可能的。可以配置两个或多个整体副本,并使用负载平衡器来分发请求。只要负载均衡器支持有状态设计的会话关联性,这适用于有状态和无状态服务。

Scaling up is the simplest way to improve responsiveness and capacity for monolithic applications. Scaling out is also possible. Two or more copies of the monolith can be provisioned, and a load balancer utilized to distribute requests. This works for both stateful and stateless services, as long as the load balancer supports session affinity for stateful designs.

随着系统功能和请求量的增长,单体应用可能会开始出现问题。这个问题有两个基本要素:

Monoliths can start to become problematic as system features and request volumes grow. This problem has two fundamental elements:

代码库复杂性
Code base complexity
随着应用程序的大小和工程团队不断壮大,添加新功能、测试和重构变得越来越困难。技术债务不可避免地会增加,如果没有在工程方面进行大量投资,代码就会变得越来越脆弱。如果没有持续一致的重构努力来维护架构完整性和代码质量,工程就会变得更加困难。推出新功能的开发节奏加快。
As the size of the application and engineering team grows, adding new features, testing, and refactoring become progressively more difficult. Technical debt inevitably builds, and without significant investments in engineering, the code becomes more and more fragile. Engineering becomes harder without continual and concerted refactoring efforts to maintain architectural integrity and code quality. Development cadence increases for rolling out new features.
横向扩展
Scaling out
您可以通过以下方式进行扩展在多个节点上复制应用程序以增加容量。但这意味着每次都要复制整个应用程序(整体)。AdvisorChat在大学管理系统中,假设向学生发布对移动设备的支持后,服务的使用量突然激增。您可以部署新的副本来处理聊天消息量,但新节点需要足够强大且数量充足才能运行完整的应用程序。您无法轻松地提取聊天服务功能并独立扩展它。
You can scale out by replicating the application on multiple nodes to add capacity. But this means replicating the entire application (the monolith) every time. In the university management system, assume a sudden spike in the use of the AdvisorChat service occurs as support for mobile devices is released to the students. You can deploy new replicas to handle the chat message volume, but the new nodes need to be powerful and numerous enough to run the complete application. You can’t easily just pull out the chat service functionality and scale it independently.

这就是微服务登场的地方。他们提供工程解决方案并扩展几乎是巨大的挑战随着请求量的快速增长,不可避免地要面临。

This is where microservices enter the scene. They provide solutions to engineering and scale out challenges that monoliths almost inevitably face as the volume of requests grows rapidly.

打破整体

Breaking Up the Monolith

微服务架构分解了将应用程序功能分解为多个独立的服务,这些服务在必要时进行通信和协调。图 9-2显示了如何使用微服务设计图 9-1中的大学管理系统。每个微服务都是完全独立的,在需要时封装自己的数据存储,并提供用于通信的 API。

A microservice architecture decomposes the application functionality into multiple independent services that communicate and coordinate when necessary. Figure 9-2 shows how the university management system from Figure 9-1 might be designed using microservices. Each microservice is totally self-contained, encapsulating its own data storage where needed, and offers an API for communications.

微服务架构示例
图 9-2。微服务架构示例

随着系统代码大小和请求负载的增长,微服务具有以下优势:

Microservices offer the following advantages as systems grow in code size and request load:

代码库
Code base
遵循两个比萨饼的规则,个人服务的复杂程度不应超过小型团队可以构建、发展和管理的程度。由于微服务是一个黑匣子,团队有充分的自主权选择自己的开发堆栈和数据管理平台。1鉴于精心设计的微服务支持的功能范围更窄、高度内聚,这应该会降低代码复杂性并提高新功能的开发节奏。此外,微服务的修订版可以根据需要独立部署。如果微服务支持的 API 稳定,则更改对于依赖的服务是透明的。
Following the two-pizza rule, an individual service should not be more complex than a small team size can build, evolve, and manage. As a microservice is a black box, the team has full autonomy to choose their own development stack and data management platform.1 Given the narrower, highly cohesive scope of functionality that a well-designed microservice supports, this should result in lower code complexity and higher development cadence for new features. In addition, revisions of the microservice can be independently deployed as needed. If the API the microservice supports is stable, the change is transparent to dependent services.
向外扩展
Scale out
单个微服务可以扩展以满足请求量和延迟要求。例如,为了满足不断要求和聊天的学生,AdvisorChat可以根据需要在其自己的负载均衡器后面复制微服务,以提供较短的响应时间。如图 9-3所示。其他轻负载的服务可以简单地在单个节点上运行或以低成本进行复制,以消除单点故障并增强可用性。
Individual microservices can be scaled out to meet request volume and latency requirements. For example, to satisfy the ever-demanding and chatting students, the AdvisorChat microservice can be replicated as needed behind its own load balancer to provide low response times. This is depicted in Figure 9-3. Other services that experience light loads can simply run on a single node or be replicated at low cost to eliminate single points of failure and enhance availability.
独立扩展微服务
图 9-3。独立扩展微服务

迁移到微服务架构时的关键设计决策之一是如何将系统功能分解为单独的服务。领域驱动设计 (DDD)提供了一种识别微服务的合适方法,因为微服务必然具有独立性,很好地映射到 DDD 中的有界上下文概念。这些主题超出了本章的范围,但对于基于微服务的应用程序架构师来说是必不可少的知识。

One of the key design decisions when moving to a microservices architecture is how to decompose the system functionality into individual services. Domain-driven design (DDD) provides a suitable method for identifying microservices, as the necessarily self-contained nature of microservices maps well to the notion of bounded contexts in DDD. These topics are beyond the scope of this chapter but are essential knowledge for architects of microservice-based applications.

但总有一个平衡的行为。微服务本质上是分布式的。通常,需要对域模型的纯度进行分析和调整,以满足分布式通信成本以及系统管理和监控复杂性的现实情况。您需要考虑请求负载以及服务这些请求所需的交互,以便微服务之间的多次交互不会导致过多的延迟。

There is always a balancing act though. Microservices are by their very nature distributed. Often, the purity of the domain model needs to be analyzed and adjusted to meet the reality of the costs of distributed communications and the complexity of system management and monitoring. You need to factor in request loads and the interactions needed to serve these requests, so that excessive latencies aren’t incurred by multiple interactions between microservices.

例如,FacultyFunding是微服务的优秀候选者。然而,如果满足“获得教师资助”或“为教师寻找资助机会”等请求而导致过多的沟通,那么性能和可靠性可能会受到影响。在这种情况下,合并微服务可能是一个明智的选择。另一种常见的方法是跨耦合微服务复制数据。这使得服务能够在本地访问其所需的数据,从而简化设计并减少数据访问响应时间。

For example, Faculty and Funding are excellent candidates for microservices. However, if satisfying requests such as “get funding by faculty” or “find funding opportunities for faculty” incur excessive communications, performance and reliability could be impacted. Merging microservices may be a sensible option in such circumstances. Another common approach is to duplicate data across coupled microservices. This enables a service to access the data it needs locally, simplifying the design and reducing data access response times.

当然,重复数据是一种权衡。需要额外的存储容量和开发工作来确保所有重复的数据收敛到一致的状态。当数据发生变化时可以立即发起重复数据更新,试图最小化重复数据不一致的时间间隔。或者,如果业务上下文允许,则可以进行定期复制(例如,每小时或每天),可能由在请求负载较低时调用的计划任务来执行。随着应用程序对性能和可扩展性的需求不断增长,与重复数据所带来的问题相比,重复数据的成本和复杂性通常很小。将出现系统的重大重构。

Duplicate data is, of course, a trade-off. It takes additional storage capacity and development effort to ensure all duplicated data converges to a consistent state. Duplicate data updates can be initiated immediately when data changes to attempt to minimize the time interval that the duplicates are inconsistent. Alternatively, if the business context allows, periodic duplication (e.g., hourly or daily) can operate, perhaps executed by a scheduled task that is invoked when request loads are low. As the demands on performance and scalability on an application grow, the cost and complexity of duplicate data is typically small compared to the problems that a major refactoring of the system would present.

部署微服务

Deploying Microservices

支持频繁更新并受益由于小型团队提供的敏捷性,您需要能够轻松快速地部署新的微服务版本。这就是我们开始侵犯持续部署和 DevOps 世界的地方,这远远超出了本书的范围(有关阅读建议,请参阅“摘要和进一步阅读” )。尽管如此,部署选项还是会影响微服务的可扩展性。我将在本节中仅描述一种部署微服务的常见方法。

To support frequent updates and benefit from the agility afforded by small teams, you need to be able to deploy new microservice versions easily and quickly. This is where we start to infringe on the world of continuous deployment and DevOps, which is way beyond the scope of this book (see “Summary and Further Reading” for reading recommendations). Still, deployment options impinge on the ease of scalability for a microservice. I’ll just describe one common approach for deploying microservices in this section.

正如我在第 8 章中所描述的,无服务器处理平台是一种有吸引力的微服务部署方法。可以构建微服务以在您选择的无服务器平台上公开其 API。无服务器选项具有三个优点:

Serverless processing platforms, as I described in Chapter 8, are an attractive microservices deployment approach. A microservice can be built to expose its API on the serverless platform of your choice. The serverless option has three advantages:

部署简单
Deployment is simple
只需将微服务的新可执行包上传到您为函数配置的端点即可。
Just upload the new executable package for your microservice to the endpoint you have configured for your function.
按用量付费
Pay by usage
如果您的服务存在请求量较低的时期,您的成本就会很低,甚至为零。
If your service has periods of low-volume requests, your costs are low, even zero.
易于扩展
Ease of scaling
您选择的平台负责处理功能的扩展。您可以通过配置参数精确控制其工作方式,但无服务器选项承担了可扩展性的重担。
The platform you choose handles scaling of your function. You control precisely how this works through configuration parameters, but the serverless option takes the heavy lifting out of scalability.

当您在无服务器平台上部署所有微服务时,您会公开客户端需要调用的多个端点。这会带来复杂性,因为客户端需要能够发现每个微服务的位置(主机 IP 地址和端口)。如果您决定通过组合两个微服务来重构您的微服务,或者为了消除网络调用,该怎么办?或者将 API 实现从一个微服务迁移到另一个微服务?或者甚至更改 API 的端点(IP 地址和端口)?

When you deploy all your microservices on a serverless platform, you expose multiple endpoints that clients need to invoke. This introduces complexity as clients need to be able to discover the location (host IP address and port) of each microservice. What if you decide to refactor your microservices by perhaps combining two in or order to eliminate network calls? Or move an API implementation from one microservice to another? Or even change the endpoint (IP address and port) of an API?

直接向客户端公开后端更改从来都不是一个好主意。《四人帮》一书多年前就通过面向对象系统中的外观模式教会了我们这一点。2在微服务中,您可以使用API 网关模式来利用类似的方法。API 网关本质上充当所有客户端请求的单个入口点,如图9-4所示。它将客户端与实现应用程序功能的微服务的底层架构隔离开来。现在,如果您重构底层 API,甚至选择部署在完全不同的平台(例如私有云)上,客户就不会注意到更改。

Exposing backend changes directly to clients is never a good idea. The Gang of Four book taught us this many years ago with the façade pattern in object-oriented systems.2 In microservices, you can exploit an analogous approach using the API gateway pattern. An API gateway essentially acts as a single entry point for all client requests, as shown in Figure 9-4. It insulates clients from the underlying architecture of the microservices that implement the application functionality. Now, if you refactor your underlying APIs or even choose to deploy on a radically different platform such as a private cloud, clients are oblivious to changes.

API网关模式
图 9-4。API网关模式

您可以在系统中利用多种 API 网关实现。这些范围从强大的开源解决方案(例如NGINX PlusKong API 网关)到云供应商特定的托管产品。如下列出的一般功能范围是相似的:

There are multiple API gateway implementations you can exploit in your systems. These range from powerful open source solutions such as the NGINX Plus and Kong API gateways to cloud vendor–specific managed offerings. The general range of functions, listed as follows, is similar:

  • 以低毫秒延迟将传入的客户端 API 请求代理到实现 API 的后端微服务。由 API 网关处理的面向客户端的 API 与后端微服务 API 之间的映射是通过管理工具或配置文件执行的。不同产品的功能和性能存在差异,有时差异很大,尤其是在高请求​​负载下。3

  • Proxy incoming client API requests with low millisecond latencies to backend microservices that implement the API. Mapping between client-facing APIs, handled by the API gateway, and backend microservice APIs is performed through admin tools or configuration files. Capabilities and performance vary across products, sometimes quite significantly, especially under high request loads.3

  • 为请求提供身份验证和授权。

  • Provide authentication and authorization for requests.

  • 定义限制每个 API 的规则。设置微服务每秒可以处理的最大请求数可用于确保后端处理不会不堪重负。

  • Define rules for throttling each API. Setting the maximum number of requests a microservice can handle per second can be used to ensure backend processing is not overwhelmed.

  • 支持API结果缓存,以便在不调用后端服务的情况下处理请求。

  • Support a cache for API results so that requests can be handled without invoking backend services.

  • 与监控工具集成以支持 API 使用情况、延迟和错误指标的分析。

  • Integrate with monitoring tools to support analysis of API usage, latencies, and error metrics.

当然,在请求激增的情况下,API 网关存在成为瓶颈的危险。API 网关的处理方式因产品而异。例如,AWS API Gateway 具有每秒 10K 请求的限制,以及高达 5K 请求/秒的额外突发配额。4 Kong API网关是无状态的,因此可以部署多个实例并使用负载平衡器。

Under heavy request spikes, there is, of course, the danger of the API gateway becoming a bottleneck. How this is handled by your API gateway is product-specific. For example, AWS API Gateway has a 10K requests per second limit, with an additional burst quota of up to 5K requests/second.4 The Kong API gateway is stateless, hence it is possible to deploy multiple instances and distribute the requests using a load balancer.

微服务原理

Principles of Microservices

设计、部署和发展基于微服务的架构的艺术和科学还有很多内容。本章到目前为止的讨论我只是触及了皮毛。在我继续解决由于微服务的分布式特性而必须解决的一些可扩展性和可用性挑战之前,有必要简要思考一下 Sam Newman 在他的优秀《构建微服务》一书中定义的微服务的核心原则(O'Reilly ) ,2015)。我在这里列出了它们,并附上了一些涉及性能和可扩展性方面的附加评论。

There’s considerably more to the art and science of designing, deploying, and evolving microservices-based architectures. I’ve just scratched the surface in the discussions so far in this chapter. Before I move on to address some of the scalability and availability challenges of microservices that must be addressed due to their distributed nature, it’s worth briefly thinking about the core principles of microservices as defined by Sam Newman in his excellent Building Microservices book (O’Reilly, 2015). I’ve listed them here with some additional commentary alluding to performance and scalability aspects.

微服务应该是:

Microservices should be:

围绕业务领域建模
Modeled around a business domain
有界上下文的概念提供了微服务范围的起点。业务领域边界可能需要在微服务之间的耦合及其可能引入的性能开销的背景下重新思考
The notion of bounded contexts provides a starting point for the scope of a microservice. Business domain boundaries may need rethinking in the context of coupling between microservices and the performance overheads it may introduce.
高度可观察
Highly observable
对每个服务的监控是对于确保它们按预期运行、以低延迟处理请求以及记录错误情况至关重要。在分布式系统中,可观察性是有效操作的基本特征
Monitoring of each service is essential to ensure they are behaving as expected, processing requests with low latencies, and error conditions are logged. In distributed systems, observability is an essential characteristic for effective operations.
隐藏实施细节
Hide implementation details
微服务是黑匣子。他们的API是他们保证支持的合同,但如何执行并不对外公开。这使每个团队可以自由选择可根据微服务的要求进行优化的开发堆栈。
Microservices are black boxes. Their API is a contract which they are guaranteed to support, but how this is carried out is not exposed externally. This gives freedom for each team to choose development stacks that can be optimized to the requirements of the microservice.
去中心化所有事情
Decentralize all the things
分散化的一件事是处理需要多次调用下游微服务的客户端请求。这些通常称为工作流程。有两种基本方法可以实现这一目标,即编排和编排。“工作流程”描述了这些主题。
One thing to decentralize is the processing of client requests that require multiple calls to downstream microservices. These are often called workflows. There are two basic approaches to achieving this, namely orchestration and choreography. “Workflows” describes these topics.
隔离故障
Isolate failure
一个微服务的失败应该不会传播给其他人并导致应用程序崩溃。系统应该继续运行,尽管服务质量可能会有所下降。本章其余部分的大部分内容都专门讨论了这一原则。
The failure of one microservice should not propagate to others and bring down the application. The system should continue to operate, although probably with some degraded service quality. Much of the rest of this chapter addresses this principle specifically.
独立部署
Deploy independently
每个微服务应该是独立的可部署,使团队能够推出增强和修改,而不依赖于其他团队的进度。
Every microservice should be independently deployable, to enable teams to roll out enhancements and modifications without any dependency on the progress of other teams.
自动化文化
Culture of automation
开发和 DevOps 工具以及实践对于获得微服务的优势是绝对必要的。自动化使得频繁更改已部署的系统变得更快、更稳健。例如,该频率可以是每小时或每天,具体取决于系统和开发速度。
Development and DevOps tooling and practices are absolutely essential to gain the benefits of microservices. Automation makes it faster and more robust to make changes to the deployed system frequently. This frequency may be, for example, hourly or daily, depending on the system and the pace of development.

微服务的弹性

Resilience in Microservices

分布式系统经常不言而喻的真理之一是,在大部分时间里,系统都在运行没有灾难性的错误。网络快速可靠,机器和磁盘很少崩溃,用于托管微服务、消息传递和数据库的基础平台非常强大。当系统处理低请求量并拥有充足的 CPU、内存和网络带宽来让用户感到非常满意时,尤其如此。当然,您的系统仍然必须为可能发生的间歇性故障做好准备,通常是在最不方便的时候!

One of the frequently unstated truisms of distributed systems is that, for the vast amount of the time, systems operate without catastrophic errors. Networks are fast and reliable, machines and disks rarely crash, the foundational platforms you use for hosting microservices and messaging and databases are incredibly robust. This is especially true when systems are handling low request volumes, and have plenty of CPU, memory, and network bandwidth to keep their users extremely happy. Of course, your system still has to be prepared for intermittent failures that will occur, usually at the most inconvenient of times!

当请求频率和数量增加时,事情开始变得非常有趣。线程争夺处理时间,内存变得稀缺,网络连接变得饱和,并且延迟增加。这是各个微服务开始表现不可预测的时候。然后,所有的赌注都落空了。

Things start to get really fun when request frequencies and volumes increase. Threads contend for processing time, memory becomes scarce, network connections become saturated, and latencies increase. This is when individual microservices start behaving unpredictably. Then, all bets are off.

为了确保您的系统不会随着负载的增加而突然发生故障,您需要采取一些必要的预防措施。我将在以下小节中解释您需要注意的问题的性质以及可用的解决方案。

To ensure your systems don’t fail suddenly as loads increase, there are a number of necessary precautions you need to take. I’ll explain the nature of the problems that you need to be aware of, and the solutions available, in the following subsections.

级联故障

Cascading Failures

图 9-5描述了一个简单的微服务架构。请求到达微服务A,进行处理这个请求,它会调用微服务B,微服务B又调用微服务C。一旦微服务C响应,B就可以将结果返回给A,A又可以响应客户端。图中的数字代表单个请求的序列。

Figure 9-5 depicts a simple microservices architecture. A request arrives at microservice A. To process this request, it calls microservice B, which in turn calls microservice C. Once microservice C responds, B can return the results to A, which in turn can respond to the client. The numbers in the figure represent this sequence for an individual request.

具有依赖关系的微服务
图 9-5。具有依赖关系的微服务

现在我假设微服务 A 上的请求负载增加。这意味着A将给B施加更多的负载,这将反过来会给 C 施加更多负载。由于某种原因,例如处理能力不足或数据库争用,这会导致微服务 C 的响应时间增加,从而对 B 造成背压,导致其对 A 的响应速度更慢。

Now I’ll assume that the request load on microservice A grows. This means A will exert more load on B, which will in turn exert more load on C. For some reason, such as lack of processing capacity or database contention, this causes the response times from microservice C to increase, which creates back pressure on B and causes it to respond more slowly to A.

如果增加的负载持续一段时间,微服务 A 和 B 中的线程将被阻塞,等待下游处理处理请求。让我们假设微服务 C 变得过载 - 也许请求模式导致频繁更新的密钥上的数据库死锁,或者与 C 数据库的网络连接变得不稳定。在过载状态下,响应时间会增加,并且 B 的线程会被阻塞以等待结果。还记得第 2 章中提到的,应用程序服务器具有固定大小的线程池。一旦 B 中的所有线程都占用了对 C 的调用,如果请求继续大量到达,它们将排队直到有线程可用。从 B 到 A 的响应时间开始增长,瞬间所有 A 的线程都将被阻塞,等待 B 响应。

If the increased load is sustained for a period of time, threads in the microservices A and B are blocked waiting for requests to be handled by downstream processing. Let’s assume microservice C becomes overloaded—perhaps the request pattern causes database deadlocks on frequently updated keys, or the network connection to C’s database becomes unstable. In an overloaded state, response times increase and B’s threads become blocked waiting for results. Remember from Chapter 2, application servers have fixed-size thread pools. Once all threads in B are occupied making calls to C, if requests continue to arrive at high volumes, they will be queued until a thread is available. Response times from B to A start to grow, and in an instant all of A’s threads will be blocked waiting for B to respond.

在这个阶段,事情可能会开始破裂。TCP 请求将超时并向调用者抛出错误。由于相关服务过载,新连接将被拒绝。如果内存耗尽,或者增加的负载发现了测试期间未发现的细微错误,微服务可能会失败。这些错误会通过调用链产生连锁反应或级联。在图9-5的示例中,C的响应速度较慢会导致向A和B的请求失败。

At this stage, things will likely start to break. TCP requests will time out and throw an error to the caller. New connections will be refused as the dependent service is overloaded. Microservices may fail if memory is exhausted, or the increased load uncovers subtle bugs that weren’t revealed during testing. These errors ripple, or cascade back through the call chain. In the example in Figure 9-5, the slow responses from C can cause requests to A and B to fail.

级联故障的阴险本质在于它们是由依赖服务的缓慢响应时间触发的。如果下游服务由于系统崩溃或短暂的网络故障而简单地失败或不可用,调用者会立即收到错误并可以做出相应的响应。对于逐渐减慢的服务来说,情况并非如此。请求返回结果,只是响应时间较长。如果不堪重负的组件继续受到请求的轰炸,它就没有时间恢复,响应时间会继续增长。

The insidious nature of cascading failures is that they are triggered by slow response times of dependent services. If a downstream service simply fails or is unavailable due to a system crash or transient network failure, the caller gets an error immediately and can respond accordingly. This is not the case with services that gradually slow down. Requests return results, just with longer response times. If the overwhelmed component continues to be bombarded with requests, it has no time to recover and response times continue to grow.

这种情况通常会因客户端在请求失败后立即重试操作而加剧,如以下代码片段所示:

This situation is often exacerbated by clients that, upon request failure, immediately retry the operation, as illustrated in the following code snippet:

int 重试 = RETRY_COUNT;
while (重试 > 0) {
   尝试 {
       调用DependentService();
       返回真;
    } catch (RemoteCallException ex) {
        日志错误(e);
        重试次数=重试次数-1;
  }
  返回假;
int retries = RETRY_COUNT;
while (retries > 0) {
   try {
       callDependentService();
       return true;
    } catch (RemoteCallException ex) {
        logError(e);
        retries = retries – 1;
  }
  return false;

立即重试只是维持不堪重负的微服务上的负载,结果非常可预测,即另一个例外。过载情况不会在几毫秒内消失。事实上,它们可能会持续数秒甚至数分钟。重试只是保持压力。

Immediate retries simply maintain the load on the overwhelmed microservice, with very predictable results, namely another exception. Overload situations don’t disappear in a few milliseconds. In fact, they are likely to persist for many seconds or even minutes. Retries just keep the pressure on.

重试示例可以通过指数退避等技术进行改进,即在重试之间插入不断增长的延迟。这可能有助于缓解下游过载,但延迟成为调用者经历的延迟的一部分,这通常无济于事。

The retry example can be improved by techniques such as exponential backoff, namely inserting a growing delay between retries. This potentially can help relieve the downstream overload, but the delay becomes part of the latency experienced by the caller, which often doesn’t help matters.

级联故障在分布式系统中很常见。无论是由不堪重负的服务引起的,还是错误情况(例如错误或网络问题)引起的,您都需要采取明确的步骤来防范它们。

Cascading failures are common in distributed systems. Whether caused by overwhelmed services, or error conditions such as bugs or network problems, there are explicit steps you need to take to guard against them.

快速失败模式

Fail fast pattern

服务速度慢的核心问题是它们长时间利用系统资源来处理请求。请求线程将被停止,直到收到响应。例如,假设我们有一个通常在 50 毫秒内响应的 API。这意味着每个线程每秒可以处理大约 20 个请求。如果一个请求由于响应时间异常而停滞 3 秒,则可以处理的请求数为 (3 × 20) – 1 = 59 个。

The core problem with slow services is that they utilize system resources for requests for extended periods. A requesting thread is stalled until it receives a response. For example, let’s assume we have an API that normally responds within 50 ms. This means each thread can process around 20 requests per second. If one request is stalled for 3 seconds due to an outlier response time, then that’s (3 × 20) – 1 = 59 requests that could have been processed.

即使为微服务设计了最好的 API,也会出现异常响应。实际工作负载表现出长尾响应时间曲线,如图9-6所示。少数请求所花费的时间比平均响应时间要长得多,有时是 20 或 100 倍。这可能有多种原因。服务器中的垃圾收集、数据库争用、过多的上下文切换、系统页面错误和网络请求丢失都是造成这种长尾的常见原因。

Even with the best designed APIs for a microservice, there will be outlier responses. Real workloads exhibit a long-tail response time profile, as illustrated in Figure 9-6. A small number of requests takes significantly longer—sometimes 20 or a 100 times more—than the average response time. This can be for a number of reasons. Garbage collection in the server, database contention, excessive context switching, system page faults, and dropped network requests are all common causes for this long tail.

典型的长尾响应时间
图 9-6。典型的长尾响应时间

正如您从该图中观察到的,绝大多数请求的响应时间都很短,这很好。然而,相当多的时间超过一秒,而一小部分则慢得多,实际上超过 4 秒。

As you can observe from this graph, the vast majority of requests have low response times, which is great. However, a significant number are over one second and a small number much, much slower—over 4 seconds, in fact.

我们可以使用百分位数来量化缓慢请求的百分比。与平均值相比,百分位数可以更丰富、更准确地了解微服务的响应时间。例如,如果我们测量响应时间并计算预期负载下的百分位数,我们可能会得到以下结果:

We can quantify the percentage of slow requests using percentiles. Percentiles give a far richer and more accurate view of response times from a microservice than averages. For example, if we measure response times and calculate percentiles under expected loads, we may get the following:

  • P50:200毫秒

  • P50: 200 milliseconds

  • P95:1,200 毫秒

  • P95: 1,200 milliseconds

  • P99:3,000 毫秒

  • P99: 3,000 milliseconds

这意味着 50% 的请求在 200 毫秒内得到满足,95% 在 1,200 毫秒内得到满足,99% 在 3,000 毫秒内得到满足。这些数字总的来说看起来相当不错。但假设我们的 API 每天处理 2 亿个请求(每秒大约 2,314 个请求)。这意味着 1%(即 200 万个请求)花费的时间超过 3 秒,比第 50 个百分位(中位数)慢 15 倍。考虑到我们在图 9-6中看到的长尾响应时间模式,一些请求将明显长于 3 秒。

This means that 50% of requests are served in less than 200 milliseconds, 95% are served within 1,200 milliseconds, and 99% percent within 3,000 milliseconds. These numbers in general look pretty good. But let’s assume our API handles 200 million requests per day (approximately 2,314 requests per second). This means 1%, or 2 million requests, take greater than 3 seconds, which is 15 times slower than the 50th percentile (the median). And some requests will be significantly longer than 3 seconds given the long-tail response time pattern we see in Figure 9-6.

无论从技术上还是从客户参与角度来说,较长的响应时间从来都不是一件好事。事实上,许多研究表明较长的响应时间会对系统使用产生负面影响。例如,BBC 报道称,页面加载时间每增加一秒,用户数量就会减少 10%。快速、稳定的响应时间对业务来说非常有利,实现这一目标的方法之一是减少长尾。这还会降低服务的总体平均响应时间,因为少数缓慢响应会严重影响平均值。

Long response times are never good things, technically or for client engagement. In fact, many studies have shown how longer response times have negative effects on system usage. For example, the BBC reported that it sees 10% less users for every additional second a page takes to load. Fast, stable response times are great for business, and one way to achieve this is to reduce the long tail. This also has the effect of decreasing the overall average response time for a service, as the average is skewed heavily by a small number of slow responses.

消除较长响应时间的常见方法是快速失败。有两种主要方法可以实现这一目标:

A common way to eliminate long response times is to fail fast. There are two main ways to achieve this:

  • 当请求花费的时间超过某个预定义的时间限制时,客户端不会等待其完成,而是向其调用者返回错误。这会释放与请求关联的线程和其他资源。

  • When a request takes longer than some predefined time limit, instead of waiting for it to complete, the client returns an error to its caller. This releases the thread and other resources associated with the request.

  • 在服务器上启用限制。如果请求负载超过某个阈值,则立即使请求失败并显示 HTTP 503 错误。这向客户端表明该服务不可用。

  • Enable throttling on a server. If the request load exceeds some threshold, immediately fail the request with an HTTP 503 error. This indicates to the client that the service is unavailable.

这些策略的具体实施方式取决于技术的具体情况。例如,发出 HTTP 请求的客户端可以配置TCP 读取超时。这指定客户端应等待接收服务器响应的时间。在图 9-6的示例中,我们可以将读取超时配置为 P99 值,即 3 秒或稍高一些。然后,如果客户端在读取超时时间内没有收到任何响应,则会引发异常。在 Java 中,它是一个java.net.SocketTimeoutException.

Exactly how these strategies are put into action is extremely technology-specific. For example, a client making an HTTP request can configure the TCP read timeout. This specifies how long a client should wait for to receive a response from the server. In our example in Figure 9-6, we could configure the read timeout to the P99 value, namely 3 seconds or a little higher. Then, if a client hasn’t received any response within the read timeout period, an exception is raised. In Java, it’s a java.net.SocketTimeoutException.

节流或速率限制是许多负载均衡器和 API 网关技术中提供的功能。什么时候达到某些定义的限制时,负载均衡器将简单地拒绝请求,从而保护其控制的资源免受过载。这使得服务能够以一致的低响应时间处理请求。还可以在微服务内部实现一些轻量级监控逻辑来实现限制。您可以保留正在进行的请求的计数,如果计数超过定义的最大值,则新请求将被拒绝。稍微复杂一点的方法可以使用滑动窗口算法来跟踪平均响应时间或 P99 等指标。如果感兴趣的指标正在增加或超过某个定义的阈值,则请求可以立即被拒绝。

Throttling, or rate limiting, is a feature available in many load balancers and API gateway technologies. When some defined limits are reached, the load balancer will simply reject requests, protecting the resources it controls from overload. This enables the service to process requests with consistent low response times. It’s also possible to implement some lightweight monitoring logic inside your microservice to implement throttling. You might keep a count of in-flight requests, and if the count exceeds a defined maximum, new requests are rejected. A slightly more sophisticated approach could track a metric like the average response time, or P99s, using a sliding window algorithm. If the metric of interest is increasing, or exceeds some defined threshold, again requests can be immediately rejected.

请求失败时还需要考虑一件事。微服务的一个原则是故障隔离。这意味着部分系统的故障不会导致整个应用程序不可用。可以继续处理请求,但功能会有所下降。

There’s one more thing to consider when failing requests. A principle of microservices is fault isolation. This means the failure of part of the system doesn’t make the whole application unavailable. Requests can continue to be processed, but with some degraded capabilities.

需要考虑的一个关键问题是是否有必要将错误传播回原始调用者。或者是否可以发送一些预设的默认响应来掩盖请求未正确处理的事实?例如,当您登录流媒体视频服务时,第一页将显示您的观看列表,以便您可以尽快返回到您喜爱的节目。但是,如果检索观看列表的请求失败或花费太长时间,则可以返回热门“您可能喜欢的节目”的默认集合。该应用程序仍然可用。

A key thing to consider is whether it is necessary to propagate the error back to the original caller. Or can some canned, default response be sent that masks the fact that the request was not correctly processed? For example, when you sign into a streaming video service, the first page will show your watchlist so you can return to your favorite shows as quickly as possible. If, however, the request to retrieve your watchlist fails, or takes too long, a default collection of popular “shows you might like” can be returned. The application is still available.

这种方法对于暂时性、短暂的故障非常有效。当用户再次发出请求时,问题可能会得到解决。用户很可能都没有注意到。然而,一些暂时性错误不会在一两秒内解决。这就是您需要更强大的方法的时候。

This approach works really well for transient, ephemeral failures. By the time the request is issued again by the users, the problem will probably be resolved. And there’s a good chance the user won’t even have noticed. Some transient errors, however, don’t resolve in a second or two. That’s when you need a more robust approach.

断路器模式

Circuit breaker pattern

如果微服务开始抛出错误由于过载情况或不稳定的网络,继续尝试向 API 发送请求没有任何意义。与其快速失败(仍然会导致超时延迟),不如立即停止发送进一步的请求,并留出一些时间来解决错误情况。这可以使用断路器模式来实现,该模式可以保护远程端点在发生某些错误情况时不会被淹没。

If a microservice starts to throw errors due to an overload situation, or a flaky network, it makes little sense to keep trying to send requests to the API. Rather than failing fast, which still incurs a timeout delay, it is better to back off immediately from sending further requests and allow some time for the error situation to resolve. This can be achieved using the circuit breaker pattern, which protects remote endpoints from being overwhelmed when some error conditions occur.

就像在电气系统中一样,客户端可以使用断路器来保护服务器免受过载。断路器配置为监视某些条件,例如来自端点的错误响应率或每秒发送的请求数。如果达到配置的阈值(例如,25% 的请求抛出错误),则会触发断路器。这会使断路器进入一种OPEN状态,在该状态下,所有调用都会立即返回错误,并且不会尝试调用不稳定或不可用的端点。

Just like in electrical systems, clients can use a circuit breaker to protect a server from overload. The circuit breaker is configured to monitor some condition, such as error response rates from an endpoint, or the number of requests sent per second. If the configured threshold is reached—for example, 25% of requests are throwing errors—the circuit breaker is triggered. This moves the circuit breaker into an OPEN state, in which all calls return with an error immediately, and no attempt is made to call the unstable or unavailable endpoint.

然后,断路器会拒绝所有调用,直到某个适当配置的超时期限到期。在此阶段,断路器进入状态HALF_OPEN。现在,断路器允许向受保护端点发出客户端调用。如果请求仍然失败,则重置超时期限并且断路器保持打开状态。但是,如果请求成功,断路器将转换到状态CLOSED,并且请求开始流向目标端点。该方案如图9-7所示。

The circuit breaker then rejects all calls until some suitably configured timeout period expires. At that stage, the circuit breaker moves to the HALF_OPEN state. Now, the circuit breaker allows client calls to be issued to the protected endpoint. If the requests still fail, the timeout period is reset and the circuit breaker stays open. However, if the request succeeds, the circuit breaker transitions to the CLOSED state and requests start to flow to the target endpoint. This scheme is illustrated in Figure 9-7.

断路器模式
图 9-7。断路器模式

断路器对于减少几乎肯定会失败的操作所使用的资源至关重要。客户端很快就会失败,OPEN断路器通过确保请求不会到达服务器来减轻不堪重负的服务器的负载。对于过载的服务,这创造了稳定的机会。当服务(希望)恢复时,断路器会自动重置并恢复正常运行。

Circuit breakers are essential to reduce the resources utilized for operations that are almost certain to fail. The client fails fast, and the OPEN circuit breaker relieves load on an overwhelmed server by ensuring requests do not reach it. For overloaded services, this creates an opportunity to stabilize. When the service (hopefully) recovers, the circuit breaker resets automatically and normal operations resume.

有许多库可用于将断路器合并到您的应用程序中。以下代码示例说明了一种流行的 Python 库CircuitBreaker 。您只需使用 修饰要保护的外部调用@circuit,并指定要设置的参数值以自定义断路器行为。在本例中,我们在检测到连续 20 次故障后触发断路器,断路器保持打开状态 5 秒,直到转换为半打开状态:

There are numerous libraries available for incorporating circuit breakers into your applications. One popular library for Python, CircuitBreaker, is illustrated in the following code example. You simply decorate the external call you want to protect with @circuit, and specify the value of the parameters you wish to set to customize the circuit breaker behavior. In this example, we trigger the circuit breaker after 20 successive failures are detected, and the circuit breaker stays open for 5 seconds until it transitions to the half open state:

从断路器输入电路

@电路(failure_threshold = 20,expected_exception = RequestException,
         恢复超时=5)
def api_call():
from circuitbreaker import circuit

@circuit(failure_threshold=20,expected_exception=RequestException,
         recovery_timeout=5)
def api_call():

断路器对于故障隔离非常有效。它们保护客户端免受相关服务的错误操作的影响,并允许服务恢复。在读取繁重的场景中,当断路器打开时,请求通常可以返回默认或缓存的结果。这有效地向客户端隐藏了故障,并且不会降低服务吞吐量和响应时间。确保将断路器触发器绑定到监控中和记录基础设施,以便诊断故障原因。

Circuit breakers are highly effective for fault isolation. They protect clients from faulty operations of dependent services and allow services to recover. In read-heavy scenarios, requests can often return default or cached results when the circuit breaker is open. This effectively hides the fault from clients and doesn’t degrade service throughput and response times. Ensure you tie circuit breaker triggers into your monitoring and logging infrastructure so that the cause of faults can be diagnosed.

舱壁图案

Bulkhead Pattern

舱壁一词的灵感来自于大型造船实践。船的内部被分成几个物理分区,确保如果船体的一个部分发生泄漏,只有一个分区被淹没,而更重要的是,船继续漂浮。基本上,舱壁是一种限制损害的策略。

The term bulkhead is inspired by large shipbuilding practices. Internally the ship is divided into several physical partitions, ensuring if a leak occurs in one part of the boat’s hull, only a single partition is flooded and the boat, rather importantly, continues to float. Basically, bulkheads are a damage limitation strategy.

想象一个具有两个端点的微服务。其中一项功能使客户能够请求通过该服务下达的当前订单的状态。另一个使客户能够创建新的产品订单。在正常操作中,大多数请求都是状态请求,需要快速缓存或数据库读取。有时,当一款热门新产品发布时,大量的新订单请求会同时到来。这些是更重量级的,需要数据库插入和写入队列。

Imagine a microservice with two endpoints. One enables clients to request the status of their current orders placed through the service. The other enables clients to create new orders for products. In normal operations, the majority of requests are status requests, entailing a fast cache or database read. Occasionally, when a popular new product is released, a flood of new order requests can arrive simultaneously. These are much more heavyweight, requiring database inserts and writes to queues.

对这两个端点的请求在微服务中部署的应用程序服务器平台中共享一个公共线程池。当新订单激增到来时,线程池中的所有线程都会被新订单创建所占用,并且状态请求基本上无法获得资源。如果使用快速失败方法,这会导致不可接受的响应时间,并且可能导致客户端调用看到异常。

Requests for these two endpoints share a common thread pool in the application server platform they are deployed on in the microservice. When a new order surge arrives, all threads in the thread pool become occupied by new order creations and status requests are essentially starved from gaining resources. This leads to unacceptable response times and potentially client calls seeing exceptions if a fail fast approach is used.

舱壁可以帮助我们解决这个问题。我们可以在微服务中预留多个线程来处理特定的请求。在我们的示例中,我们可以指定新订单请求最多有 150 个共享线程池线程可供其独占使用。这确保了当新的订单请求突发发生时,我们仍然可以以可接受的响应时间处理状态请求,因为线程池中有额外的容量。

Bulkheads help us solve this problem. We can reserve a number of threads in a microservice to handle specific requests. In our example, we could specify that the new order request has a maximum of 150 threads of the shared thread pool available for its exclusive use. This ensures that when a new order request burst occurs, we can still handle status requests with acceptable response times because there is additional capacity in the thread pool.

Java Resilience4j 库使用 Java 8 及以后的函数式编程功能提供了隔板模式的实现。隔板模式将远程资源调用隔离在它们自己的线程池中,以便单个过载或失败的服务不会消耗应用程序服务器中的所有可用线程。

The Java Resilience4j library provides an implementation of the bulkhead pattern using the functional programming features of Java 8 onward. The bulkhead pattern segregates remote resource calls in their own thread pools so that a single overloaded or failing service does not consume all threads available in the application server.

以下示例代码演示如何创建最多允许 150 个并发请求的隔板。BulkheadFullException如果您希望通过隔板限制的服务正在使用 150 个线程,则请求将在抛出默认异常之前最多等待 1 秒:

The following example code shows how to create a bulkhead that allows a maximum of 150 concurrent requests. If 150 threads are in use for the service that you wish to restrict with the bulkhead, requests will wait a maximum of 1 second before the default BulkheadFullException exception is thrown:

// 配置舱壁
BulkheadConfig 配置 = BulkheadConfig.custom()
            .maxConcurrentCalls(150)
            .maxWaitDuration(持续时间.ofSeconds(1))
            。建造();
BulkheadRegistry 注册表 = BulkheadRegistry.of(config);
// 创建舱壁
隔板 newOrderBulkhead =registry.bulkhead("newOrder");
// configure the bulkhead
BulkheadConfig config = BulkheadConfig.custom()
            .maxConcurrentCalls(150)
            .maxWaitDuration(Duration.ofSeconds(1))
            .build();
BulkheadRegistry registry = BulkheadRegistry.of(config);
// create the bulkhead
Bulkhead newOrderBulkhead = registry.bulkhead("newOrder");

接下来,您指定该OrderService.newOrder()方法应该用隔板装饰。这确保最多可以同时发生 150 次对此方法的调用:

Next, you specify that the OrderService.newOrder() method should be decorated with the bulkhead. This ensures that a maximum of 150 invocations of this method can occur concurrently:

// 用舱壁装饰 OrderService.newOrder 方法
供应商<OrderOutcome> orderSupplier = () ->
    OrderService.newOrder(OrderInfo);
// 使用舱壁配置装饰 NewOrder
供应商<OrderOutcome> bukheadOrderSupplier =
    舱壁.decorateSupplier(舱壁,orderSupplier);
// decorate the OrderService.newOrder method with the bulkhead 
Supplier<OrderOutcome> orderSupplier = () -> 
    OrderService.newOrder(OrderInfo);
// decorate NewOrder with the bulkhead configuration
Supplier<OrderOutcome> bukheadOrderSupplier = 
    bulkhead.decorateSupplier(bulkhead, orderSupplier);

Spring-boot 使用其依赖注入功能简化了隔板的创建。您可以在文件中指定隔板的配置application.yml,如下所示:

Spring-boot simplifies the creation of a bulkhead using its dependency injection capabilities. You can specify the configuration of the bulkhead in the application.yml file, shown as follows:

服务器:
  雄猫:
    线程:
      最大:200
弹性4j.舱壁:
  实例:
    订单服务:
      最大并发调用数:150
      最大等待持续时间:1000ms
server:
  tomcat:
    threads:
      max: 200
resilience4j.bulkhead:
  instances:
    OrderService:
      maxConcurrentCalls: 150
      maxWaitDuration: 1000ms

在代码中,您只需使用@Bulkhead装饰器来指定应受隔板行为影响的方法。在以下示例中,还指定了后备方法。当达到舱壁容量并且请求等待超过 1 秒时将调用此函数:

In the code, you simply use the @Bulkhead decorator to specify the method that should be subject to the bulkhead behavior. In the following example, a fallback method is also specified. This will be invoked when the bulkhead capacity is reached, and requests wait for more than 1 second:

@Bulkhead(名称=“OrderService”,fallbackMethod=“newOrderBusy”)
   public OrderOutcome newOrder(OrderInfo inf){// 详情省略}
@Bulkhead(name = "OrderService", fallbackMethod = “newOrderBusy”)
   public OrderOutcome newOrder(OrderInfo inf){// details omitted}

总结和延伸阅读

Summary and Further Reading

拥抱微服务需要您采用新的设计和开发实践来创建一系列细粒度、有凝聚力的组件,以满足您的应用程序需求。此外,您还需要面对分布式系统的新机遇和复杂性。如果您采用微服务,您就别无选择。

Embracing microservices requires you to adopt new design and development practices to create a collection of fine-grained, cohesive components to satisfy your application requirements. In addition, you also need to confront the new opportunities and complexities of distributed systems. If you adopt microservices, you simply have no choice.

本章简要概述了微服务的动机及其所能带来的优势。在本书中,独立扩展单个微服务以满足不断增长的需求的能力通常是无价的。

This chapter has given a brief overview of the motivations for microservices and the advantages they can afford. In the context of this book, the ability to independently scale individual microservices to match increasing demand is often invaluable.

微服务经常耦合,需要通信才能满足单个请求。这使得它们容易遭受级联故障。当微服务开始返回响应时间不断增加的请求时,就会发生这种情况——例如,由于请求过载或暂时性网络错误而导致。缓慢的响应时间会导致调用服务出现背压,最终其中一个服务出现故障可能会导致所有依赖服务崩溃。

Microservices are frequently coupled, needing to communicate to satisfy a single request. This makes them susceptible to cascading failures. These occur when a microservice starts to return requests with increasing response times—caused, for example, by an overload in requests or transient network errors. Slow response times cause back pressure in the calling services, and eventually a failure in one can cause all dependent services to crash.

避免级联故障的模式包括使用超时和断路器快速失败。这些本质上为受压的微服务提供了恢复和阻止级联故障发生的时间。舱壁图案的意图相似。它可用于确保对微服务中的一个 API 的请求在请求突发期间不会利用所有可用资源。通过对应用服务器中某个特定API可以请求的线程数设置最大限制,可以保证其他API的处理能力。

Patterns for avoiding cascading failures include failing fast using timeouts and circuit breakers. These essentially give the stressed microservice time to recover and stop cascading failures from occurring. The bulkhead pattern is similar in intent. It can be used to ensure requests to one API in a microservice don’t utilize all available resources during a request burst. By setting a maximum limit on the number of threads in the application server a particular API can demand, processing capacity for other APIs can be guaranteed.

微服务是软件架构中的一个主要主题。要完整、全面地涵盖该主题,没有比 Sam Newman 的《构建微服务》第二版(O'Reilly,2021 年)更好的来源了。这将带您深入了解基于微服务的系统的设计、开发和部署。

Microservices are a major topic in software architecture. For a complete and comprehensive coverage of the topic, there is no better source than Sam Newman’s Building Microservices, 2nd Edition (O’Reilly, 2021). This will take you on an in-depth journey following the design, development, and deployment of microservices-based systems.

微服务需要开发过程的广泛自动化。Jez Humble 和 David Farley(Addison-Wesley Professional)撰写的2011 年经典《持续交付:通过构建、测试和部署自动化实现可靠的软件发布》是全面介绍该主题的理想起点。另一个优秀的信息来源是Len Bass、Ingo Weber 和 Liming Zhu 撰写的《DevOps:软件架构师的视角》 (Addison-Wesley Professional,2015 年)。DevOps 的世界是一个快速发展且技术丰富的领域,您最喜欢的搜索引擎是查找有关构成现代 DevOps 管道的各种构建、配置、测试、部署和监控平台的信息的最佳场所。

Microservices require extensive automation of the development process. The 2011 classic Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by Jez Humble and David Farley (Addison-Wesley Professional) is an ideal place to start for a comprehensive introduction to the topic. Another excellent source of information is DevOps: A Software Architect’s Perspective by Len Bass, Ingo Weber, and Liming Zhu (Addison-Wesley Professional, 2015). The world of DevOps is a fast-moving and technologically rich domain, and your favorite search engine is the best place to find information on the various build, configuration, test, deployment, and monitoring platforms that comprise modern DevOps pipelines.

接下来,本书的第三部分重点讨论存储层的主题。我将描述决定我们如何分布数据存储以实现可扩展性、可用性和一致性的核心原则和算法。该理论将通过研究许多广泛使用的数据库如何在分布式系统中运行以及它们采取的各种方法和架构权衡来补充。

Next, Part III of this book focuses on the topic of the storage layer. I’ll be describing the core principles and algorithms that determine how we can distribute data stores to achieve scalability, availability, and consistency. The theory will be complemented by examining how a number of widely used databases operate in distributed systems, and the various approaches and architectural trade-offs they take.

1当然,跨微服务的开发堆栈标准化确实有优势,正如 Susan Fowler 在“生产就绪微服务”(O'Reilly,2016 年)中所解释的那样。

1 Standardization of the development stack across microservices does, of course, have advantages, as Susan Fowler explains in Production-Ready Microservices (O’Reilly, 2016).

2 Erich Gamma 等人,设计模式:可重用面向对象软件的元素(Addison Wesley Professional,1994)。

2 Erich Gamma et al., Design Patterns: Elements of Reusable Object-Oriented Software (Addison Wesley Professional, 1994).

3 一项出色的 NGINX 研究对 API 网关的性能进行了基准测试。它是由供应商之一执行的,因此在解释结果时需要小心谨慎。此类研究对于评估潜在解决方案很有价值。

3 An excellent NGINX study benchmarks the performance of API gateways. It is performed by one of the vendors, so a hint of caution in interpreting results is required. Studies like this are valuable in assessing potential solutions.

4该限制可以增加

4 This limit can be increased.

第三部分。可扩展的分布式数据库

Part III. Scalable Distributed Databases

第三部分将我们带入扩展数据层的复杂领域。这是分布式系统理论最突出的地方。当系统引入数据副本以促进可扩展性时,必须解决其他系统质量(例如可用性,尤其是一致性)——这些质量在分布式数据系统中不可磨灭地交织在一起。我将激发对使分布式数据库发挥作用的算法的需求,并概述一些所使用的算法。然后,我将说明这些算法如何在主要分布式数据库(包括 MongoDB、Google Cloud Spanner 和 Amazon DynamoDB)中体现。

Part III takes us into the complex realm of scaling the data tier. This is where distributed systems theory is most prominent. As systems introduce data replicas to facilitate scalability, other system qualities such as availability and especially consistency must be addressed—these qualities are indelibly entwined in distributed data systems. I’ll motivate the need for the algorithms that make distributed databases function and sketch out some of the algorithms that are utilized. I’ll then illustrate how these algorithms are manifested in major distributed databases including MongoDB, Google Cloud Spanner, and Amazon DynamoDB.

第 10 章可扩展数据库基础知识

Chapter 10. Scalable Database Fundamentals

2000年代初,数据库的世界是一个相对平静和简单的地方。有一些例外,但绝大多数应用程序都是基于关系数据库技术构建的。系统利用了主要供应商的少数关系数据库之一,这些数据库仍然在当今数据库市场份额排名中占据前十名。

In the early 2000s, the world of databases was a comparatively calm and straightforward place. There were a few exceptions, but the vast majority of applications were built on relational database technologies. Systems leveraged one of a handful of relational databases from the major vendors, and these still dominate the top ten spots in database market share ranking today.

如果你能跳进某个时间机器并查看 2001 年的类似排名,您可能会发现当前前 10 名中的 7 个(都是关系数据库)与它们在 2022 年占据的位置相似。但是,如果您检查 2022 年的前 20 名,至少有 10 个当前列出的数据库引擎中有 20 年前还不存在,而且其中大多数都不是关系型的。市场已扩大且多元化。

If you could jump into a time machine and look at a similar ranking from 2001, you’d probably find 7 of the current top 10—all relational databases—in similar places to the ones they occupy in 2022. But if you examine the top 20 in 2022, at least 10 of the current database engines listed did not exist 20 years ago, and most of these are not relational. The market has expanded and diversified.

本章是第三部分中四章的第一章,重点关注数据层或持久存储层。我将介绍不断变化和发展的可扩展数据库环境,包括分布式非关系和关系方法,以及支撑这些技术的基本方法。

This chapter is the first of four in Part III that focuses on the data—or persistent storage—tier. I’ll cover the ever-changing and evolving scalable database landscape, including distributed nonrelational and relational approaches, and the fundamental approaches that underpin these technologies.

在本章中,我将解释传统关系数据库如何发展为采用分布式架构来解决可扩展性问题。然后我将介绍新一代数据库的一些主要特征,这些数据库本身就支持分发。最后,我将描述用于跨多个数据库节点分发数据的架构以及这些方法固有的权衡,无论它们支持什么数据模型。

In this chapter, I’ll explain how traditional relational databases have evolved to adopt distributed architectures to address scalability. I’ll then introduce some of the main characteristics of the new generation of databases that have emerged to natively support distribution. Finally, I’ll describe the architectures utilized for distributing data across multiple database nodes and the trade-offs inherent with these approaches regardless of the data models they support.

分布式数据库

Distributed Databases

我们今天构建的数据系统使那些相形见绌20 年前,关系型数据库统治地球。数据集大小和复杂性的增长是由互联网规模的应用程序推动的。它们为数千万用户创建和管理大量异构数据。例如,这包括用户个人资料、用户偏好、行为数据、图像和视频、销售数据、广告、传感器读数、监控数据等等。许多数据集太大而无法容纳在一台机器上。

The data systems we build today dwarf those of 20 years ago, when relational databases ruled the earth. This growth in data set size and complexity has been driven by internet-scale applications. These create and manage vast quantities of heterogeneous data for literally tens of millions of users. This includes, for example, user profiles, user preferences, behavioral data, images and videos, sales data, advertising, sensor readings, monitoring data, and much more. Many data sets are simply far too big to fit on a single machine.

这就需要进化数据库引擎来管理大量分布式数据。新一代关系型和非关系型数据库平台已经出现,具有广泛的竞争能力,旨在满足不同的用例和可扩展性要求。同时,低成本、功能强大的硬件的发展使得在数百甚至数千个节点和磁盘上经济高效地分发数据成为可能。这增强了可扩展性,并通过复制数据提高了可用性。

This has necessitated the evolution of database engines to manage massive collections of distributed data. New generations of relational and nonrelational database platforms have emerged, with a wide range of competing capabilities aimed at satisfying different use cases and scalability requirements. Simultaneously, the development of low-cost, powerful hardware has made it possible to cost-effectively distribute data across literally hundreds or even thousands of nodes and disks. This enhances both scalability and, by replicating data, availability.

数据库引擎创新的另一个主要驱动力是当今互联网上的应用程序需求不断变化的性质。关系数据库的固有优势,即事务和一致性,是以性能为代价的,而这在 Twitter 和 Facebook 等网站上并不总是合理的。这些并不要求每个用户始终看到相同的版本,例如我的推文或时间线更新。谁在乎我的美味晚餐的最新照片是否立即被我的一些粉丝和朋友看到,而其他人则必须等待几秒钟才能欣赏我正在吃的巧妙菜肴?

Another major driver of database engine innovation has been the changing nature of the application requirements that populate the internet today. The inherent strengths of relational databases, namely transactions and consistency, come at a performance cost that is not always justified in sites like Twitter and Facebook. These don’t have requirements for every user to always see the same version of, for example, my tweets or timeline updates. Who cares if the latest photo of my delicious dinner is seen immediately by some of my followers and friends, while others have to wait a few seconds to admire the artful dish I’m consuming?

对于数万到数百万的用户,可以放松关系数据库支持的各种数据限制并获得增强的性能和可扩展性。这使得能够创建新的非关系数据模型和本机分布式数据库引擎,旨在支持当今应用程序的各种用例。当然,这也是需要权衡的。这些体现在数据库支持的功能范围及其编程模型的复杂性上。

With tens of thousands to millions of users, it is possible to relax the various data constraints that relational databases support and attain enhanced performance and scalability. This enables the creation of new, nonrelational data models and natively distributed database engines, designed to support the variety of use cases for today’s applications. There are trade-offs, of course. These manifest themselves in the range of features a database supports and the complexity of its programming model.

扩展关系数据库

Scaling Relational Databases

支持关系模型和 SQL 查询语言的数据库代表了一些最成熟、稳定和强大的数据库当今存在的软件平台。您会发现关系数据库潜伏在您能想象到的各种类型的应用程序领域的系统后面。它们是极其复杂且极其成功的技术。

Databases that support the relational model and SQL query language represent some of the most mature, stable, and powerful software platforms that exist today. You’ll find relational databases lurking behind systems in every type of application domain you can imagine. They are incredibly complex and amazingly successful technologies.

关系数据库技术是在数据集按照今天的标准相对较小的时候设计和成熟的,并且数据库可以在单台机器上运行。随着数据集的增长,扩展数据库的方法也随之出现。我将在以下小节中通过一些示例简要介绍这些内容。

Relational database technology was designed and matured when data sets were relatively small by today’s standards, and the database could run on a single machine. As data sets have grown, approaches to scale databases have emerged. I’ll briefly cover these with some examples in the following subsections.

扩大

Scaling Up

关系型数据库的设计在单台机器上运行,这使得可以利用共享内存和磁盘来存储数据和处理查询。这使得数据库引擎可以定制为在具有多个 CPU、磁盘和大型共享内存的计算机上运行。数据库引擎可以利用这些资源并行执行数千个查询,以提供极高的吞吐量。

Relational databases were designed to run on a single machine, which enables shared memory and disks to be exploited to store data and process queries. This makes it possible for database engines to be customized to run on machines with multiple CPUs, disks, and large shared memories. Database engines can exploit these resources to execute many thousands of queries in parallel to provide extremely high throughput.

图 10-1描述了扩展场景。数据库迁移到新的、更强大的(虚拟)硬件。虽然数据库管理魔法可以执行迁移并调整数据库配置以有效利用新资源,但应用程序代码不需要任何更改。

Figure 10-1 depicts the scale-up scenario. The database is migrated to new, more powerful (virtual) hardware. While there is database administration magic to perform the migration and tune the database configuration to effectively exploit the new resources, the application code should require no changes.

这种方法有三个主要缺点:

There are three main downsides to this approach:

成本
Cost
随着提供的计算资源的增长,硬件成本往往呈指数级增长。
Hardware costs tend to grow exponentially as the computational resources offered grow.
可用性
Availability
尽管功能强大,但您仍然拥有一个数据库节点。如果它变得不可用,则您的系统已关闭。存在多种高可用性 (HA) 解决方案,它们提供了检测数据库备份副本不可用性和故障转移的机制。许多 HA 解决方案都依赖于数据库供应商。
You still have a single database node, albeit a powerful one. If it becomes unavailable, your system is down. A multitude of high availability (HA) solutions exist that offer mechanisms to detect unavailability and failover to a backup copy of the database. Many HA solutions are database vendor dependent.
生长
Growth
如果您的数据库继续增长,则不可避免地会再次迁移到更强大的硬件。
If your database continues to grow, another migration to more powerful hardware is inevitable.
关系数据库扩展场景示例
图 10-1。关系数据库扩展场景示例

在许多应用中,扩大规模确实很有吸引力。尽管如此,在大批量应用中,有两种常见情况会导致扩展出现问题。首先,数据库的增长超出了单个节点的处理能力。其次,需要低延迟的数据库访问来为遍布全球的客户提供服务。穿越洲际网络根本无法解决这个问题。

Scaling up is indeed attractive in many applications. Still, in high-volume applications, there are two common scenarios in which scaling up becomes problematic. First, the database grows to exceed the processing capability of a single node. Second, low latency database accesses are required to service clients spread around the globe. Traversing intercontinental networks just doesn’t cut it.

在这两种情况下,都需要分发数据库。

In both cases, distributing a database is necessary.

横向扩展:只读副本

Scaling Out: Read Replicas

增加数据库容量的常见第一步处理能力是使用只读副本进行横向扩展。您将一个或多个节点配置为主数据库的只读副本。主数据库节点称为主节点,只读副本称为辅助节点。辅助数据库维护主数据库的副本。写入只能写入主数据库,然后所有更改都会异步复制到辅助数据库。辅助节点可能位于不同的数据中心或不同的大陆,以支持全球客户。

A common first step to increasing a database’s processing capacity is to scale out using read replicas. You configure one or more nodes as read replicas of the main database. The main database node is known as the primary, and read replicas are known as secondaries. The secondaries maintain a copy of the main database. Writes are only possible to the primary, and all changes are then asynchronously replicated to secondaries. Secondaries may be physically located in different data centers or different continents to support global clients.

该架构如图 10-2所示。

This architecture is shown in Figure 10-2.

通过读复制进行分发
图 10-2。通过读复制进行分发

此方法通过将所有读取定向到只读副本来增强可扩展性。1因此,它对于必须支持读取繁重工作负载的应用程序非常有效。可以通过添加更多辅助节点来扩展读取,从而减少主节点上的负载。这使其能够更有效地处理写入。此外,如果主数据库由于暂时性故障而变得不可用,则定向到辅助数据库的读取请求不会中断。

This approach enhances scalability by directing all reads to the read replicas.1 It is hence highly effective for applications that must support read-heavy workloads. Reads can be scaled by adding more secondaries, reducing the load on the primary. This enables it to more efficiently handle writes. In addition, if the primary becomes unavailable due to a transient failure, read requests directed to secondaries are not interrupted.

由于数据写入主数据库和成功复制到辅助数据库之间存在延迟,因此客户端有可能从辅助数据库读取过时的数据。因此应用程序必须意识到这种可能性。在正常操作中,更新主数据库和辅助数据库之间的时间应该很短,例如几毫秒。这个时间窗口越小,出现陈旧读取的可能性就越小。

As there is a delay between when data is written to the primary and then successfully replicated to the secondaries, there is a chance that clients may read stale data from secondaries. Application must therefore be aware of this possibility. In normal operations, the time between updating the primary and the secondaries should be small, for example, a few milliseconds. The smaller this time window, then the less chance there is of a stale read.

读复制和基于主/辅助的数据库架构是我将在本节和第二节中更详细地讨论的主题。以下章节。

Read replication and primary/secondary–based database architectures are topics I’ll return to in much more detail in this and the following chapters.

横向扩展:数据分区

Scale Out: Partitioning Data

将数据拆分或分区关系数据库,是一种将数据库分布在多个独立磁盘分区和数据库引擎上的技术。分区的具体支持方式是高度特定于产品的。一般来说,有两种策略:水平分区和垂直分区。

Splitting up, or partitioning data in a relational database, is a technique for distributing the database over multiple independent disk partitions and database engines. Precisely how partitioning is supported is highly product-specific. In general, there are two strategies: horizontal partitioning and vertical partitioning.

水平分区将一个逻辑表拆分为多个物理分区。根据某种分区策略将各个行分配到分区。常见的分区策略是根据行中的某些值将行分配给分区,或者在主键上使用哈希函数。如图10-3所示,您可以根据每行中region字段的值将行分配给分区。

Horizontal partitioning splits a logical table into multiple physical partitions. Individual rows are allocated to a partition based on some partitioning strategy. Common partitioning strategies are to allocate rows to partitions based on some value in the row, or to use a hash function on the primary key. As shown in Figure 10-3, you can allocate a row to a partition based on the value of the region field in each row.

数据库水平分区
图 10-3。数据库水平分区

垂直分区也称为行拆分,按行中的列对表进行分区。与标准化一样,垂直分区将一行分成一个或多个部分,但出于物理优化而非概念优化的原因。常见的策略是在静态、只读数据和动态数据之间划分行。图 10-4显示了采用此方案的库存系统的简单垂直分区。

Vertical partitioning, also known as row splitting, partitions a table by the columns in a row. Like normalization, vertical partitioning splits a row into one or more parts, but for the reasons of physical rather than conceptual optimization. A common strategy is to partition a row between static, read-only data and dynamic data. Figure 10-4 shows a simple vertical partitioning for an inventory system that employs this scheme.

垂直数据库分区
图 10-4。垂直数据库分区

关系数据库引擎将为数据分区提供不同级别的支持。有些有助于在磁盘上对表进行分区。其他支持跨节点分区数据以在分布式系统中水平扩展。

Relational database engines will have various levels of support for data partitioning. Some facilitate partitioning tables on disk. Others support partitioning data across nodes to scale horizontally in a distributed system.

无论如何,关系模式的本质(数据分布在多个表中)使得设计通用的分发分区策略变得困难。数据的水平分区理想地将表分布在多个节点上。然而,如果单个请求需要访问多个节点的数据,或者连接分布式分区表的数据,则需要高水平的网络流量和请求协调。这可能不会带来您期望的性能优势。以下侧边栏简要介绍了这些问题。

Regardless, the very nature of relational schemas, with data split across multiple tables, makes it problematic to devise a general partitioning strategy for distribution. Horizontal partitions of data ideally distribute tables across multiple nodes. However, if a single request needs to access data from multiple nodes, or join data from distributed partitioned tables, a high level of network traffic and request coordination is required. This may not give the performance benefits you expect. These issues are briefly covered in the following sidebar.

示例:Oracle RAC

Example: Oracle RAC

尽管分区存在固有的问题由于关系模型和大规模 SQL 查询的复杂性,供应商在过去二十年中一直致力于横向扩展关系数据库。一个著名的例子是 Oracle 的真正应用集群 (RAC) 数据库

Despite the inherent problems of partitioning relational models and the complexities of SQL queries at scale, vendors have worked in the last two decades to scale out relational databases. One notable example is Oracle’s Real Applications Cluster (RAC) database.

Oracle的RAC数据库于2001年发布,为大容量、高可用的系统提供分布式版本的Oracle数据库引擎。从本质上讲,Oracle 可以部署多达 100 个 Oracle 数据库引擎的集群,这些引擎都访问同一物理数据库。

Oracle’s RAC database was released in 2001 to provide a distributed version of the Oracle database engine for high-volume, highly available systems. Essentially, Oracle makes it possible to deploy a cluster of up to 100 Oracle database engines that all access the same physical database.

为了避免数据分区问题,Oracle RAC 是共享一切数据库的一个示例。集群数据库引擎访问构成 Oracle 数据库的数据文件、日志和配置文件的单个共享数据存储。对于数据库客户端来说,集群部署是透明的,并且显示为单个数据库引擎。

To avoid the data partitioning problem, Oracle RAC is an example of a shared-everything database. The clustered database engines access a single, shared data store of the data files, logs, and configuration files that comprise an Oracle database. To the database client, the clustered deployment is transparent and appears as a single database engine.

物理存储需求使用称为存储区域网络 (SAN) 的网络可访问存储解决方案可供所有节点访问。SAN 提供对 Oracle 数据库的高速网络访问。SAN 还必须提供硬件级磁盘镜像,以创建应用程序和系统数据的多个副本,以应对磁盘故障。在高负载下,SAN 可能成为瓶颈。高端 SAN 是极其专业的存储设备,购买起来非常昂贵。

The physical storage needs to be accessible to all nodes using a network-accessible storage solution known as Storage Area Network (SAN). SANs provide high-speed network access to the Oracle database. SANs also must provide hardware-level disk mirroring to create multiple copies of application and system data in order to survive disk failure. Under high load, the SAN can potentially become a bottleneck. High-end SANs are extremely specialized storage devices that are expensive beasts to acquire.

Oracle RAC 部署需要两个专有软件组件,即:

Two proprietary software components are required for Oracle RAC deployments, namely:

集群件
Clusterware
支持通信和集群数据库引擎之间的协调。例如,它管理集群节点成员资格、节点故障转移和高可用性。
Supports communications and coordination between the clustered database engines. It manages, for example, cluster node membership, node failover, and high availability.
缓存融合
Cache Fusion
启用每个中的单独缓存集群数据库节点能够有效共享,从而最大限度地减少对持久存储的访问。
Enables the individual caches in each clustered database node to be effectively shared so that accesses to the persistent store are minimized.

RAC系统的概述如图10-5所示。

An overview of a RAC system is shown in Figure 10-5.

Oracle RAC 概述
图 10-5。Oracle RAC 概述

Oracle RAC 展示了一种扩展关系数据库的架构方法,即共享一切。它为 Oracle 部署增加了处理能力和高可用性,同时(理论上)不需要更改应用程序代码。该数据库需要多个专有的 Oracle 软件组件以及昂贵的冗余存储和互连硬件。加上 Oracle 许可成本,无论如何您都没有低成本的解决方案。

Oracle RAC illustrates one architectural approach, namely shared everything, to scaling a relational database. It adds processing capacity and high availability to an Oracle deployment while requiring (in theory anyway) no application code changes. The database requires multiple proprietary Oracle software components and expensive redundant storage and interconnect hardware. Add Oracle license costs, and you don’t have a low-cost solution by any means.

过去 20 年来,许多 Oracle 客户都采用了这项技术。它已经成熟且经过验证,但从当今的技术格局来看,它所基于的架构以高成本提供有限的按需可扩展性。另一种方案,即利用广泛可用的低成本商品计算节点和存储的无共享架构,是我今后重点关注的方法。

Many Oracle customers have adopted this technology in the last 20 years. It’s mature and proven, but through the lens of today’s technology landscape, is based on an architecture that offers limited on-demand scalability at high costs. The alternative, namely a shared-nothing architecture that exploits the widely available low-cost commodity compute nodes and storage is the approach I’ll focus on going forward.

向 NoSQL 的转变

The Movement to NoSQL

我没有足够的勇气去尝试和构建一个连贯的叙述,描述了创造新一代 NoSQL 数据库技术的力量。2我个人的倾向是,这种创新是由 2000 年代初开始积聚势头的多种原因共同推动的。排名不分先后,其中一些原因是:

I’m not brave enough to try and construct a coherent narrative describing the forces that brought about the creation of a new generation of NoSQL database technologies.2 My personal inclination is that this innovation was driven by a confluence of reasons that started to gather momentum in the early 2000s. In no particular order, some of these reasons were:

  • 开发功能强大、低成本的商品硬件,包括多核 CPU、更快、更大的磁盘以及更高的网络速度。

  • The development of powerful, low-cost, commodity hardware, including multicore CPUs, faster, larger disks, and increased network speeds.

  • 处理非结构化数据类型以及快速发展的业务和数据模型的应用程序的出现。关系追随者的“一刀切”方法不再适用于这些新用例。

  • The emergence of applications that dealt with unstructured data types and rapidly evolving business and data models. No longer was the “one size fits all” approach of relational adherents applicable to these new use cases.

  • 对面向互联网的应用程序的可扩展性和可用性的需求不断增加。

  • Increased need for scalability and availability for internet-facing applications.

  • 收集原始数据并将其用于新的业务洞察和分析的新机会。

  • New opportunities to gather raw data and utilize this for new business insights and analytics.

结合我在本章中描述的针对海量数据集扩展关系数据库的复杂性,现在正是需要新的数据库范例的时候。这种创新所需的许多数据库和分布式系统理论都是已知的,这为一系列新数据库平台的出现创造了肥沃的土壤。

Combined with the complexities of scaling relational databases for massive data sets that I’ve described in this chapter, the time was rife for a new database paradigm. Much of the database and distributed systems theory that was needed for such innovation was known, and this created fertile ground for the emergence of a whole collection of new database platforms.

NoSQL 数据库生态系统是为了应对 2000 年代初不断变化的业务和技术环境而蓬勃发展的,它绝不是一个同质的地方。出现了几种不同的方法并在一定程度上在各种(主要是开源)数据库中实现。但总的来说,NoSQL 运动的核心特征是:

The NoSQL database ecosystem that blossomed to address the evolving business and technological landscape of the early 2000s is by no means a homogeneous place. Several different approaches emerged and were implemented to some extent in various (mostly open source) databases. In general, however, the core characteristics of the NoSQL movement are:

  • 可以轻松演化的简化数据模型

  • Simplified data models that can be easily evolved

  • 对联接的支持有限或不支持的专有查询语言

  • Proprietary query languages with limited or no support for joins

  • 对低成本商用硬件上的水平扩展的本机支持

  • Native support for horizontal scaling on low-cost, commodity hardware

我将在下面的小节中依次介绍这些特征。但在此之前,请考虑一下:如果没有执行类似 JOIN 查询的能力,NoSQL 数据库如何生存?答案在于如何使用 NoSQL 建模数据。

I’ll look at each of these characteristics in turn in the following subsections. But before that, consider this: how do NoSQL databases survive without the capability to execute JOIN-like queries? The answer lies in how you model data with NoSQL.

NoSQL 连接

NoSQL JOIN

为了说明,并且在当时写作、CouchBase、Oracle NoSQL 和 MongoDB 支持某种形式的联接,但通常有限制。Oracle NoSQL 连接仅限于分层相关的表。MongoDB的$lookup操作只允许对其中一个集合进行分区。Cassandra、DynamoDB、Riak 和 Redis 不支持联接操作。像 Neo4j 和 OrientDB 这样的图数据库使用图遍历算法和操作,因此不需要连接。

For illustration, and at the time of writing, CouchBase, Oracle NoSQL, and MongoDB support some form of joins, often with limitations. Oracle NoSQL joins are limited to hierarchically related tables only. MongoDB’s $lookup operation allows only one of the collections to be partitioned. Cassandra, DynamoDB, Riak, and Redis have no support for join operations. Graph databases like Neo4j and OrientDB use graph traversal algorithms and operations and hence have no need for joins.

数据模型标准化,如受关系数据库的鼓励,提供了一种经过验证的问题域建模技术。它为每个数据项创建具有单个条目的模型,可以在需要时引用。更新只需要修改规范数据引用,然后更新即可用于引用该数据的所有查询。由于 SQL 和联接的强大功能,您不必过多考虑立即和将来访问数据的所有奇怪而美妙的方式。您的规范化模型(理论上)应该支持应用程序域的任何合理查询,而 SQL 使之成为可能。

Data model normalization, as encouraged by relational databases, provides a proven technique for modeling the problem domain. It creates models with a single entry for every data item, which can be referenced when needed. Updates just need to modify the canonical data reference, and the update is then available to all queries that reference the data. Due to the power of SQL and joins, you don’t have to think too hard about all the weird and wonderful ways the data will be accessed, both immediately and in the future. Your normalized model should (in theory) support any reasonable query for the application domain, and SQL is there to make it possible.

使用 NoSQL,重点从问题域建模转变为解决方案域建模。解决方案领域建模要求您考虑应用程序必须支持的常见数据访问模式,并设计支持这些访问的数据模型。对于读取数据,这意味着您的数据模型必须预先加入服务请求所需的数据。本质上,您生成了关系建模者认为的非规范化数据模型。您正在以灵活性换取效率。

With NoSQL, the emphasis changes from problem domain modeling to modeling the solution domain. Solution domain modeling requires you to think about the common data access patterns the application must support, and to devise a data model that supports these accesses. For reading data, this means your data model must prejoin the data you need to service a request. Essentially, you produce what relational modelers deem a denormalized data model. You are trading off flexibility for efficiency.

考虑解决方案域建模的另一种方法是为每个用例创建一个表。举个例子,滑雪者和单板滑雪者喜欢使用他们的应用程序列出他们每个季节参观他们最喜欢的山脉的天数、他们乘坐的缆车次数以及天气状况。使用规范化,您可能会生成如下所示的逻辑数据模型,并创建实现该模型的表:

Another way of thinking about solution domain modeling is to create a table per use case. As an example, skiers and snowboarders love to use their apps to list how many days they have visited their favorite mountains each season, how many lifts they rode, and what the weather was like. Using normalization, you’d probably produce something like the following as a logical data model and create tables that implement the model:

SnowSportPerson = {ssp_id, ssp_name, 地址, 出生日期, ……….}
度假村 = {度假村 ID、度假村名称、位置、……}
访问 = {ssp_id、resort_id、日期、numLifts、垂直、…..}
天气 = {resort_id、日期、最高气温、最低气温、风、...}
SnowSportPerson = {ssp_id, ssp_name, address, dob, ……….}
Resort = {resort_id, resort_name, location, …..}
Visit = {ssp_id, resort_id, date, numLifts, vertical, …..}
Weather = {resort_id, date, maxtemp, mintemp, wind, …}

使用 SQL,可以很简单地JOIN生成特定人员的访问列表,如下所示:

Using SQL, it’s straightforward JOIN wizardry to generate a list of visits for a specific person that looks like the following:

  • 摘要伊恩·戈顿天数: 2

  • Summary Ian Gorton Number of days: 2

日期 采取 电梯数量 总垂直英尺 最高/最低温度 (F) 风速(英里/小时)
2021 年 12 月 2 日 北纬 49 度 17 号 27,200 27/19 11
12月9日 银山 14 22,007 32/16 3

在 NoSQL 数据建模中,您创建一个数据模型,将查询所需的结果全部放在一个表中。如下所示,aVisitDay具有生成上面列表中每一行所需的所有数据项。您只需将结果集中的对象数量相加VisitDay即可计算单个人的天数。3

In NoSQL data modeling, you create a data model that has the results the query needs all together in a table. As shown in the following, a VisitDay has all the data items needed to generate each line in the list above. You just have to sum the number of VisitDay objects in the results set to calculate the number of days for a single person.3

VisitDay = {日期、度假村名称、ssp_id、ssp_name、numLifts、垂直、最高温度、最低温度、风}
VisitDay = {date, resort_name, ssp_id, ssp_name, numLifts, vertical, maxtemp, mintemp, wind}

SnowSportPerson 、ResortWeather表将与原始模型保持不变这意味着我们在您的逻辑表中存在重复的数据。在此示例中,这些表中的大部分数据都是一次性写入且永远不会更改(例如,特定日期的天气状况),因此复制只会占用更多磁盘空间,这不是现代系统中的主要问题。

The SnowSportPerson, Resort, and Weather tables would remain unchanged from your original model. This means we have duplicated data across your logical tables. In this example, most of the data in these tables is write-once and never changes (e.g., weather conditions for a particular day), so duplication just uses more disk space—not a major problem in modern systems.

但想象一下,如果度假村名称发生变化。它确实偶尔会发生。此更新必须检索该度假村的所有VisitDay条目并更新每个条目中的度假村名称。在非常大的数据库中,此更新可能需要几十秒或更长时间,但由于这是一项数据维护操作,因此可以在一个黑夜运行,以便第二天新名称神奇地出现在用户面前。

Imagine, though, if a resort name changes. It does actually happen occasionally. This update would have to retrieve all VisitDay entries for that resort and update the resort name in every entry. In a very large database, this update might take a few tens of seconds or more, but as it’s a data maintenance operation, it can be run one dark night so that the new name appears magically to users the next day.

所以你有它。如果您设计数据模型以根据主要用例有效处理请求,则不需要像连接这样的复杂操作。除此之外,数据分区和分发变得更加容易,并且好处开始大规模叠加。通常,读取速度更快而写入速度更慢,这是一种权衡。您还必须仔细考虑如何实现重复数据的更新并维护数据完整性。

So there you have it. If you design your data model to efficiently process requests based on major use cases, complex operations like joins are unnecessary. Add to this that it becomes easier to partition and distribute data and the benefits start to stack up at scale. The trade-offs are that, typically, reads are faster and writes are slower. You also have to think carefully about how to implement updates to duplicate data and maintain data integrity.

NoSQL 数据模型

NoSQL Data Models

如图10-6所示,NoSQL数据模型主要有四种,它们都比关系模型简单一些。

As illustrated in Figure 10-6, there are four main NoSQL data models, all of which are somewhat simpler than the relational model.

NoSQL 数据模型
图 10-6。NoSQL 数据模型

从根本上来说,这些模型之间存在微妙的重叠。但忽略这些微妙之处,这四个是:

Fundamentally there are subtle overlaps between these models. But ignoring these subtleties, the four are:

核心价值
Key-value
键值(KV)数据库基本上是哈希图。数据库中的每个对象都有一个唯一的键,用于检索与该键关联的数据。对于数据库来说,与键关联的数据通常对数据库引擎是不透明的。它可以是字符串、JSON、图像或业务问题所需的任何其他内容。KV 数据库的示例包括RedisOracle NoSQL
Key-value (KV) databases are basically a hash map. Every object in the database has a unique key that is used to retrieve data associated with that key. To the database, the data associated with the key is typically opaque to the database engine. It can be a string, JSON, image, or whatever else the business problem demands. Examples of KV databases include Redis and Oracle NoSQL.
文档
Document
文档数据库构建在 KV 模型上,数据库中的每个文档都需要唯一的键。与键关联的值对于数据库来说并不是不透明的。相反,它通常采用 JSON 进行编码,从而可以在查询中引用文档中的各个元素,并使数据库可以在文档字段上构建索引。文档通常被组织成类似于关系表的逻辑集合,但不要求集合中的所有文档都具有相同的格式。领先的文档数据库是MongoDBCouchbase
A document database builds on the KV model, again with each document in the database requiring a unique key. The value associated with the key is not opaque to the database. Rather it is encoded, typically in JSON, making it possible to reference individual elements in a document in queries and for the database to build indexes on document fields. Documents are usually organized into logical collections analogous to relational tables, but there is no requirement for all documents in the collection to have the same format. Leading document databases are MongoDB and Couchbase.
宽柱
Wide column
宽列数据库扩展通过组织与命名列中的键关联的数据来构建 KV 模型。它本质上是一个二维哈希图,使行中的列能够使用列名称进行唯一标识和排序。与文档数据库一样,集合中的每一行都可以有不同的列。Apache Cassandra和 Google Bigtable是宽列数据库的示例。
A wide column database extends the KV model by organizing data associated with a key in named columns. It’s essentially a two-dimensional hash map, enabling columns within a row to be uniquely identified and sorted using the column name. Like a document database, each row in a collection can have different columns. Apache Cassandra and Google Bigtable are examples of wide column databases.
图形
Graph
图表很好理解用于存储和查询高度关联的数据的数据结构。想想 Facebook 上的朋友,或者航空公司在机场之间飞行的航线。图将数据库对象之间的关系视为一等公民,因此可以有效地实现各种基于图的算法。从概念上讲,最接近关系数据库的突出例子是Neo4jAmazon Neptune
Graphs are well understood data structures for storing and querying highly connected data. Think of your friends on Facebook, or the routes flown by an airline between airports. Graphs treat relationships between database objects as first-class citizens, and hence enable a wide range of graph-based algorithms to be efficiently implemented. Conceptually closest to relational databases, prominent examples are Neo4j and Amazon Neptune.

无论数据模型如何,NoSQL 数据库通常被称为无模式数据库。与关系数据库不同,写入数据库的每个对象的格式不必预先定义。这使得可以轻松地发展数据对象格式,因为逻辑集合中的每个对象不需要具有相同的格式。

Regardless of data model, NoSQL databases are usually termed as schemaless databases. Unlike relational databases, the format of every object you write into the database does not have to be defined up front. This makes it possible to easily evolve data object formats as there is no need for every object in a logical collection to have the same format.

这种灵活性不可避免的代价是应用程序有责任发现它所读取的数据的结构。这需要将数据对象与元数据(基本上是字段名称)一起存储在数据库中,从而使结构发现成为可能。您经常会看到这两种方法,称为写入时架构(定义的架构)和读取时架构(无架构)。

The inevitable trade-off for this flexibility is that it becomes the responsibility of the application to discover the structure of the data it reads. This requires data objects to be stored in the database along with metadata (basically field names) that make structure discovery possible. You’ll often see these two approaches called schema-on-write (defined schema) and schema-on-read (schemaless).

查询语言

Query Languages

NoSQL 数据库查询语言是几乎总是专有于特定数据库,并且在基于 API 的显式功能和类似 SQL 的声明性语言之间有所不同。由供应商和第三方实现的各种语言的客户端库可供在应用程序中使用。例如,MongoDB 官方支持 12 个不同语言的客户端库,并提供更多的第三方产品

NoSQL database query languages are nearly always proprietary to a specific database, and vary between explicit API-based capabilities and SQL-like declarative languages. Client libraries in various languages, implemented by the vendor as well as third parties, are available for utilization in applications. For example, MongoDB officially supports twelve client libraries for different languages and has third-party offerings for many more.

KV 数据库可能只提供支持基于各个键值的 CRUD 操作的 API。文档数据库通常支持单个文档字段的索引。这使得能够高效地实现检索结果集并对满足各种搜索条件的文档应用更新的查询。例如,以下是一个 MongoDB 查询,它从滑雪者数据库集合中检索 16 岁以上尚未更新滑雪通行证的个人的所有文档:

KV databases may offer little more than APIs that support CRUD operations based on individual key values. Document databases normally support indexing of individual document fields. This enables efficient implementations of queries that retrieve results sets and apply updates to documents that satisfy various search criteria. For example, the following is a MongoDB query that retrieves all the documents from the skiers database collection for individuals older than 16 who have not renewed their ski pass:

db.skiers.find( {
   年龄:{$gt:16},
   更新:{ $exists: false }}
)
db.skiers.find( {
   age: { $gt:  16},
   renew: { $exists: false }} 
)

宽列数据库有一个多种查询功能。HBase 支持 Java CRUD API,能够使用过滤器检索结果集。Cassandra 查询语言 (CQL) 以 SQL 为模型,提供用于访问底层宽列存储的声明性语言。如果您熟悉 SQL,那么 CQL 看起来会非常熟悉。CQL 绝不实现完整的 SQL 功能集。例如,CQLSELECT语句只能应用于单个表,不支持连接或子查询。

Wide column databases have a variety of query capabilities. HBase supports a Java CRUD API with the ability to retrieve result sets using filters. Cassandra Query Language (CQL) is modeled on SQL and provides a declarative language for accessing the underlying wide column store. If you are familiar with SQL, CQL will look very familiar. CQL by no means implements the full set of SQL features. For example, the CQL SELECT statement can only apply to a single table and doesn’t support joins or subqueries.

图数据库支持更丰富的查询功能。OrientDB使用SQL作为基本查询语言,并实现扩展来支持图查询另一个例子是Cypher,最初是为Neo4j图数据库设计的,并通过openCypher项目开源。Cypher 提供了匹配图中节点和关系模式的功能,以及类似于 SQL 的强大查询和插入语句。以下示例返回与滑雪场节点有访问关系的每个人的电子邮件名称为Mission Ridge的属性:

Graph databases support much richer query capabilities. OrientDB uses SQL as the basic query language and implements extensions to support graph queries. Another example is Cypher, originally designed for the Neo4j graph database, and open sourced through the openCypher project. Cypher provides capabilities to match patterns of nodes and relationships in the graph, with powerful query and insert statements analogous to SQL. The following example returns the emails of everyone who has a visited relationship to the ski resort node with a name property of Mission Ridge:

MATCH (p:Person)-[rel:VISITED]->(c:Skiresort)
WHERE c.name = 'Mission Ridge'
返回电子邮件
MATCH (p:Person)-[rel:VISITED]->(c:Skiresort)
WHERE c.name = ‘Mission Ridge’
RETURN p.email

数据分布

Data Distribution

NoSQL 数据库的总体设计跨配备本地存储的分布式计算节点本机水平扩展。这是一个无共享架构,因为与我在 Oracle RAC 中描述的共享一切方法相反。由于没有共享状态,瓶颈和单点故障都被消除,5并且性能、可扩展性和可用性得到增强。这一规则有一个值得注意的例外,那就是图形数据库,正如我在下面的边栏中所描述的那样。

NoSQL databases are in general designed to natively scale horizontally across distributed compute nodes equipped with local storage. This is a shared nothing architecture, as opposed to the shared everything approach I described with Oracle RAC. With no shared state, bottlenecks and single points of failure are eliminated,5 and performance, scalability, and availability enhanced. There’s one notable exception to this rule, and that is graph databases, as I describe in the following sidebar.

分区,通常称为分片,需要一种将逻辑数据库集合中的数据对象分布到多个服务器节点的算法。理想情况下,分片算法应该在可用资源上均匀分布数据。也就是说,如果您有 1 亿个对象和 10 个相同的数据库服务器,则每个分片将有 1000 万个对象驻留在本地。

Partitioning, commonly known as sharding, requires an algorithm to distribute the data objects in a logical database collection across multiple server nodes. Ideally, a sharding algorithm should evenly distribute data across the available resources. Namely, if you have one hundred million objects and ten identical database servers, each shard will have ten million objects resident locally.

分片需要一个分片或分区键,用于将给定的数据对象分配到特定的分区。创建新对象时,分片键将该对象映射到驻留在服务器上的特定分区。当查询需要访问对象时,它会提供分片键,以便数据库引擎可以在其所在的服务器上找到该对象。如图 10-7所示。

Sharding requires a shard or partition key that is used to allocate a given data object to a specific partition. When a new object is created, the shard key maps the object to a specific partition that resides on a server. When a query needs to access an object, it supplies the shard key so the database engine can locate the object on the server it resides. This is illustrated in Figure 10-7.

数据分区
图 10-7。数据分区

分片存在三种主要技术,所有分布式数据库都将实现以下一种或多种方法:

Three main techniques exist for sharding, and all distributed databases will implement one or more of these approaches:

散列键
Hash key
任何给定数据对象的分区被选择作为对分片键应用哈希函数的结果。然后哈希的结果被映射到分区。有两种主要方法可以做到这一点,使用模数方法或称为一致性哈希的算法。
The partition for any given data object is chosen as the result of applying a hash function to the shard key. The result of the hash is then mapped to a partition. There are two main ways of doing this, using a modulus approach or an algorithm known as consistent hashing.
以价值为基础
Value-based
分区的选择基于关于分片键的值。例如,您可能希望根据客户的居住国家/地区对客户数据进行分区。选择国家/地区字段作为分片键将确保居住在中国的客户的所有数据对象都驻留在同一分区中,所有芬兰客户都分配到同一分区中,依此类推。
The partition is chosen based on the value of the shard key. For example, you might want to partition your data on customers based on their country of residence. Choosing the country field as the shard key would ensure all data objects for customers who live in China reside in the same partition, all Finland customers are allocated to the same partition, and so on.
基于范围
Range-based
分区托管数据对象,其中片键位于片键值的特定范围内。例如,您可以使用邮政编码/邮政编码范围将位于同一地理区域的所有客户对象分配到同一分区。
Partitions host data objects where the shard key resides within a specific range of the shard key value. For example, you might use zip code/post code ranges to allocate all customer objects who reside in the same geographical area to the same partition.

分区可以通过添加处理和磁盘容量以及在这些额外资源之间分配数据来扩展数据库。但是,如果其中一个分区由于网络错误或磁盘崩溃而不可用,则无法访问数据库的一部分。

Partitioning makes it possible to scale out a database by adding processing and disk capacity and distributing data across these additional resources. However, if one of the partitions is unavailable due to a network error or disk crash, then a chunk of the database cannot be accessed.

解决这个可用性问题需要引入复制。每个分区中的数据对象通常被复制到两个或更多节点。如果一个节点不可用,应用程序可以通过访问副本之一继续执行。这种分区、复制的架构如图 10-8所示。每个分区都有三个副本,每个副本托管在不同的节点上。

Solving this availability problem requires the introduction of replication. The data objects in each partition are replicated to typically two or more nodes. If one node becomes unavailable, the application can continue to execute by accessing one of the replicas. This partitioned, replicated architecture is shown in Figure 10-8. Each partition has three replicas, with each replica hosted on a different node.

数据分区和复制,每个分区三个副本
图 10-8。数据分区和复制,每个分区三个副本

复制增强了可用性和可扩展性。存储副本的附加资源可用于处理来自应用程序的读取和写入请求。

Replication enhances both availability and scalability. The additional resources that store replicas can be used to handle both read and write requests from applications.

然而,与分布式系统一样,有一个需要解决的复杂问题。当发生数据更新请求时,数据库需要更新所有副本。这确保了副本是一致的,并且所有客户端都将读取相同的值,无论它们访问哪个副本。

There is however, as always with distributed systems, a complication to address. When a data update request occurs, the database needs to update all replicas. This ensures the replicas are consistent and all clients will read the same value regardless of the replica they access.

有两种用于管理分布式数据库复制的基本体系结构。这些都是:

There are two basic architectures for managing distributed database replication. These are:

领导者-追随者
Leader-follower
一个副本被指定为领导者,它始终保存任何数据对象的最新值。所有写入都定向到领导者,领导者负责将更新传播到副本。追随者是只读副本。应用程序读取可以在追随者之间进行负载平衡,以扩展读取性能。
One replica is designated the leader and it always holds the latest value of any data object. All writes are directed to the leader, which is responsible for propagating updates to the replicas. The followers are read-only replicas. Application reads can be load balanced across the followers to scale out read performance.
群龙无首
Leaderless
任何副本都可以同时处理读取和更新。当更新发送到副本时,它成为该更新的请求协调器,并负责确保其他副本得到正确更新。由于写入可以由任何副本处理,因此对于写入量大的应用程序来说,无领导者方法往往更具可扩展性。
Any replica can handle both reads and updates. When an update is sent to a replica, it becomes the request coordinator for that update and is responsible for ensuring the other replicas get correctly updated. As writes can be handled by any replica, the leaderless approach tends to be more scalable for write-heavy applications.

事实证明,副本一致性是一个棘手的分布式系统问题。问题的核心在于关于如何以及何时将更新传播到副本以确保它们具有相同的值。不同的延迟以及网络和硬件故障等常见问题使得这变得非常重要。

Replica consistency turns out to be a thorny distributed systems issue. The core of the problem revolves around how and when updates are propagated to replicas to ensure they have the same values. The usual issues of varying latencies and network and hardware failures make this totally nontrivial.

如果数据库能够确保所有副本始终具有相同的值,则可以说它提供了强一致性,因为所有客户端访问都会为每个数据对象返回相同的值。这意味着客户端必须等到所有副本都被修改后才能确认更新成功。

If a database can ensure all replicas always have the same value, then it is said to provide strong consistency, as all client accesses will return the same value for every data object. This implies the client must wait until all replicas are modified before an update is acknowledged as successful.

相反,客户端可能只想等待一个副本更新,并相信数据库会尽快更新其他副本。这意味着您有一个时间窗口,副本不一致,并且读取可能会也可能不会返回最新值。允许副本不一致的数据库称为最终一致性。接下来的三章将详细讨论强一致性和最终一致性之间的权衡,以及设计选择如何影响可扩展性和可用性。

In contrast, a client may only want to wait for one replica to be updated, and trust the database to update the others as soon as it can. This means you have a window of time when replicas are inconsistent and reads may or may not return the latest value. Databases that allow replica inconsistency are known as eventually consistent. The trade-offs between strong and eventual consistency and how design choices affect scalability and availability are dealt with in detail in the next three chapters.

CAP定理

The CAP Theorem

Eric Brewer 著名的 CAP 定理6优雅地封装了在使用分布式数据库时实现副本一致性和可用性的选项。它描述了如果存在网络分区,即当网络丢弃或延迟数据库中节点之间发送的消息时,数据库系统可以做出的选择。

Eric Brewer’s famous CAP theorem6 elegantly encapsulates the options you have for replica consistency and availability when utilizing distributed databases. It describes the choices a database system has if there is a network partition, namely when the network drops or delays messages sent between the nodes in the database.

基本上,如果网络如果运行正确,系统就可以是一致的和可用的。如果发生网络分区,系统可以是一致的(CP)或可用的(AP)。

Basically, if the network is operating correctly, a system can be both consistent and available. If a network partition occurs, a system can be either consistent (CP) or available (AP).

出现这种情况是因为网络分区意味着数据库中的某些节点无法被其他节点访问——分区将数据库分成两组节点。如果发生更新并且更新的数据对象的副本驻留在分区的两侧,则数据库可以:

This situation arises because a network partition means some nodes in the database are not accessible to others—the partition splits the database into two groups of nodes. If an update occurs and the replicas for the updated data object reside on both sides of the partition, then the database can either:

  • 返回错误,因为它无法确保副本一致性(CP)。

  • Return an error as it cannot ensure replica consistency (CP).

  • 将更新应用到可见副本子集 (AP)。这意味着在分区修复并且数据库可以使所有副本保持一致之前,会存在副本不一致的情况。在解决不一致之前,客户端可能会看到同一数据对象的不同值。

  • Apply the update to the subset of replicas that are visible (AP). This means there is replica inconsistency until the partition heals and the database can make all replicas consistent. Until the inconsistency is resolved, clients may see different values for the same data object.

您将看到用于不同 NoSQL 数据库的 AP 或 CP 分类。它很有用,但并不完全有意义,因为正如我将在第 13 章中解释的那样,大多数数据库都可以调整配置参数以实现AP或CP以满足应用要求。

You’ll see the AP or CP categorization used for different NoSQL databases. It’s useful but not totally meaningful as most databases, as I’ll explain in Chapter 13, make it possible to tune configuration parameters to achieve AP or CP to meet application requirements.

总结和延伸阅读

Summary and Further Reading

随着系统规模的增长,数据库领域发生了一场革命。数据库必须存储大量数据,为全球分布的客户端提供快速的查询响应时间,并且 24/7 可用。这要求数据库技术变得分布式并采用新的数据模型,这些模型更适合现代应用程序所需的非结构化、不断变化的数据类型。

As the scale of systems has grown, a revolution has taken place in the database realm. Databases must store massive volumes of data, provide rapid query response times for globally distributed clients and be available 24/7. This has required database technologies to become distributed and adopt new data models that are more amenable to the unstructured, ever changing data types necessitated by modern applications.

在本章中,我解释了为什么关系数据库和 SQL 大规模时会出现问题。相比之下,NoSQL 数据库采用简单的数据模型,可以复制和分区,以支持海量数据集和请求量。一如既往,需要权衡。NoSQL数据库不支持SQL丰富的查询功能,给应用程序带来了较大的负担。分布式数据库设计者还需要了解 CAP 定理列举的一致性和可用性权衡。

In this chapter, I’ve explained why relational databases and SQL can become problematic at scale. In contrast, NoSQL databases adopt simple data models that can be replicated and partitioned to support massive data sets and request volumes. As always, there are trade-offs. NoSQL databases do not support the rich query features of SQL, placing a greater burden on the application. Distributed database designers also need to be aware of the consistency and availability trade-offs that are enumerated by the CAP theorem.

有了这些基础,接下来的三章重点讨论 CAP 定理推断的权衡的复杂性。我将解释在各种数据库中设计和实现的方法,以使应用程序能够平衡一致性、可用性和性能以满足其要求。

With these foundations, the following three chapters focus on the complexities of the trade-offs inferred by the CAP theorem. I’ll explain the approaches that have been devised and implemented in various databases to enable applications to balance consistency, availability, and performance to meet their requirements.

对于 NoSQL 数据库的精彩介绍,仍然很难超越Pramod Sadalage 和 Martin Fowler 所著的《NoSQL Distilled:A Brief Guide to the Emerging World of Polyglot Persistence》 (Addison-Wesley Professional,2013 年)。为了更广泛地涵盖数据库领域,包括 SQL 和 NoSQL, Andreas Meier 和 Michael Kaufmann撰写的《SQL 和 NoSQL 数据库:大数据管理的模型、语言、一致性选项和架构》(Springer,2019 年)非常值得一读。最后,如果本章激起了您深入了解数据库如何工作的兴趣,强烈推荐Alex Petrov 的《数据库内部原理》 (O'Reilly,2019 年)。

For an excellent introduction to NoSQL databases, it’s still hard to beat NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by Pramod Sadalage and Martin Fowler (Addison-Wesley Professional, 2013). For broader coverage of the database landscape, including both SQL and NoSQL, SQL and NoSQL Databases: Models, Languages, Consistency Options and Architectures for Big Data Management by Andreas Meier and Michael Kaufmann (Springer, 2019) is well worth a read. Finally, if this chapter has whetted your appetite for learning about how databases work in depth, Alex Petrov’s Database Internals (O’Reilly, 2019) is highly recommended.

1主节点通​​常可以配置为处理读取和写入。这一切都高度依赖于应用程序。

1 The primary can typically be configured to handle reads as well as writes. This is all highly application-dependent.

2 NoSQL 可能代表 Not Only SQL,但这有点空洞。最好将 NoSQL 视为一个简单的标签而不是缩写词。

2 NoSQL probably stands for Not Only SQL, but this is somewhat vacuous. It’s best to regard NoSQL as a simple label rather than an acronym.

3大多数 NoSQL 数据库支持嵌入或嵌套数据对象。这使得可以为一个人的度假胜地访问创建单个数据库对象,并在每次新访问发生时更新该对象。这简化了读取,因为查询仅检索包含所有所需访问数据的一个对象。根据数据库的不同,更新可能不如插入高效。这是一个非常特定于数据库的问题。

3 Most NoSQL databases support embedded or nested data objects. This makes it possible to create a single database object for a person’s resort visits and update this object every time a new visit occurs. This simplifies reads as a query just retrieves one object that contains all the visit data needed. Depending on the database, updates may not be as efficient as inserts. This is a very database-specific issue.

4 Chris Date,数据库设计和关系理论:范式和爵士乐,第二版。(阿普莱斯,2012)。

4 Chris Date, Database Design and Relational Theory: Normal Forms and All That Jazz, 2nd ed. (Apress, 2012).

5同样,这实际上取决于数据库实现。无共享架构理论上消除了单点故障和瓶颈,但某些实现又将它们添加回来!

5 Again, this is in fact database implementation dependent. Shared-nothing architecture theoretically removes single points of failure and bottlenecks, but some implementations add them back!

6 Eric Brewer,“CAP 十二年后:‘规则’如何变化”,计算机,第 45 卷,第 2 期 (2012),23–29。

6 Eric Brewer, “CAP Twelve Years Later: How the ‘Rules’ Have Changed,” Computer, Volume 45, Issue 2 (2012), 23–29.

7迁移到 MyRocks 和 MySQL 8.0 版本的故事非常值得花几分钟阅读。

7 The story of the move to MyRocks and MySQL version 8.0 is well worth a few minutes’ reading.

8 MongoDB 在百度的使用情况在这里进行了简要描述,并链接到包含更多详细信息的优秀演示文稿。

8 MongoDB’s usage at Baidu is briefly described here, and links to an excellent presentation with more details.

第 11 章最终一致性

Chapter 11. Eventual Consistency

随着分布式 NoSQL 数据库的出现,最终一致性变得越来越重要。对于一些人来说,这个概念仍然是异端邪说,是在关系数据库事务时代提出的。在某些应用领域(通常提到银行和金融)中,最终一致性根本不合适。无论如何,争论也是如此。

Eventual consistency has risen in prominence with the emergence of distributed, NoSQL databases. It’s still a concept that has been and remains heretical to some, raised in the era of transactions with relational databases. In some application domains, with banking and finance usually cited, eventual consistency simply isn’t appropriate. So goes the argument, anyway.

事实上,最终一致性在银行业已经使用多年。有人记得写支票吗?支票需要几天时间才能在您的帐户上进行核对,并且您可以轻松地开出比您帐户中的金额更多的支票。然而,当检查得到处理并建立一致性时,您可能会看到一些后果。

In fact, eventual consistency has been used in the banking industry for many years. Anyone remember writing checks? Checks take days to be reconciled on your account, and you can easily write checks for more money than you have in your account. When the checks get processed, and consistency is established, you might see some consequences, however.

与ATM交易类似。如果 ATM 机与网络隔离并且无法检查您的余额,您通常仍然可以提取现金,尽管金额有限。此时您的账户余额不一致。当分区修复后,ATM 将发送交易以供后端系统处理,并计算出您帐户的正确价值。

It is similar with ATM transactions. If an ATM is partitioned from the network and cannot check your balance, you will still usually be able to get cash, albeit limited to a small amount. At this stage your account balance is inconsistent. When the partition heals, the ATM will send the transactions to be processed by the backend systems and the correct value for your account will be calculated.

在可扩展的互联网系统时代,最终一致性已经找到了许多合适的用例。在本章中,我将深入探讨在大规模使用分布式数据库构建最终一致的系统时需要注意的主要问题。

In the era of scalable internet systems, eventual consistency has found many suitable use cases. In this chapter, I’ll delve into the major issues that you need to be aware of when building eventually consistent systems with distributed databases at scale.

什么是最终一致性?

What Is Eventual Consistency?

在过去的美好时光,当系统对所有数据项都有单一的事实来源(数据库)时,副本一致性就成为了不是问题。根本就没有复制品。但正如我在第 10 章中所解释的,许多系统需要跨多个节点扩展其数据库,以提供必要的处理和存储容量。另外,为了保证每个节点的数据高可用,还需要复制每个节点的内容,以消除单点故障。

In the good old days, when systems had a single source of truth for all data items—the database—replica consistency was not a problem. There simply were no replicas. But as I explained in Chapter 10, many systems need to scale out their databases across multiple nodes to provide the necessary processing and storage capacity. In addition, to ensure the data for each node is highly available, you also need to replicate the contents of each node to eliminate single points of failure.

突然间你的数据库变成了一个分布式系统。当数据库节点和网络快速且可靠地工作时,您的用户不知道他们正在与分布式系统进行交互。副本似乎是即时更新的,并且用户请求的处理响应时间很短。不一致的读取很少见。

Suddenly your database has become a distributed system. When the database nodes and networks are fast and working reliably, your users have no idea they are interacting with a distributed system. Replicas are updated seemingly instantaneously, and user requests are processed with low response times. Inconsistent reads are rare.

但正如您现在所知,分布式系统需要能够处理各种故障模式。这意味着数据库必须处理高度可变的网络延迟、通信和机器故障所固有的所有问题。这些故障意味着您的数据库副本可能会在比您的应用程序希望容忍的时间更长的时间内保持不一致。这会产生您需要理解并能够解决的问题。

But as you know by now, distributed systems need to be able to handle various failure modes. This means the database has to deal with all the issues inherent with highly variable network latencies, and communication and machine failures. These failures mean your database replicas may remain inconsistent for longer periods than your application may wish to tolerate. This creates issues you need to understand and be able to address.

不一致窗口

Inconsistency Window

最终一致系统中的不一致窗口是数据对象的更新传播到所有副本所需的持续时间。在基于领导者的系统中,领导者协调其他副本的更新。在无领导者系统中,任何副本(或者可能是任何数据库节点——这取决于实现)协调更新。当所有副本具有相同值时,不一致窗口结束。

The inconsistency window in an eventually consistent system is the duration it takes for an update to a data object to propagate to all replicas. In a leader-based system, the leader coordinates the updating of other replicas. In a leaderless system, any replica (or potentially any database node—this is implementation dependent) coordinates the update. The inconsistency window ends when all replicas have the same value.

有几个因素会影响不一致窗口的持续时间。这些概述如下:

Several factors affect the duration of the inconsistency window. These are outlined in the following:

副本数量
The number of replicas
您拥有的副本越多,就越多副本更新需要协调。仅当所有副本都相同时,不一致窗口才会关闭。如果您有三个副本,则只需要三个更新。您拥有的副本越多,其中一个副本响应缓慢并延长不一致窗口的可能性就会增加。
The more replicas you have, the more replica updates need to be coordinated. The inconsistency window only closes when all replicas are identical. If you have three replicas, then only three updates are needed. The more replicas you have, the chances of one of your replicas responding slowly and elongating the inconsistency window increases.
运行环境
Operational environment
任何瞬时操作故障,例如当短暂的网络故障或丢失数据包时,可以延长不一致窗口。副本更新延迟的主要原因可能是节点上的读/写工作负载过重。这会导致副本过载并引入额外的数据传播延迟。因此,数据库承受的负载越大,不一致窗口可能就越长。
Any instantaneous operational glitches, such as a transient network failure or lost packets, can extend the inconsistency window. Probably the main cause for replica update delays is a heavy read/write workload at a node. This causes replicas to become overloaded and introduces additional data propagation latency. Hence the more load your database is experiencing, the longer the inconsistency window is likely to be.
副本之间的距离
Distance between replicas
如果所有副本都是在同一局域网子网上,通信延迟可以是亚毫秒级。如果您的副本之一跨越整个大陆或世界各地,则不一致窗口的最小值将是副本之间的往返时间。考虑到地理分布,这个时间可能相对较大,实际上是几十毫秒。1正如我在第 3 章中所解释的,这完全取决于距离。
If all replicas are on the same local area network subnet, communications latencies can be submillisecond. If one of your replicas is across the continent or across the world, the minimum value of the inconsistency window will be the round-trip time between replicas. With geographical distribution, this could be relatively large, several tens of milliseconds, in fact.1 It all depends on the distance as I explained in Chapter 3.

所有这些问题意味着您无法控制不一致窗口的持续时间。您无法提供或知道上限。对于异步传达状态变化的最终一致系统,这是您必须接受的现实。

All these issues mean that you don’t have control over the duration of the inconsistency window. You can’t provide or know an upper bound. With eventually consistent systems that communicate state changes asynchronously, this is a fact of life you have to live with.

读你自己写的

Read Your Own Writes

不久前,在预订航班时,我不得不更新我的信用卡信息,因为一家大商店被黑客攻击而发行了一张新信用卡。我及时添加了新卡信息并保存,并继续结帐流程以支付我的航班费用。令我惊讶的是,付款被拒绝,因为我没有更新我的信用卡信息。等一下,我想,然后检查了我的个人资料。新卡的详细信息在我的个人资料中标记为默认卡。于是,我再次尝试交易,一切正常。

Not too long ago, while booking a flight, I had to update my credit card information as a new one had been issued due to a hack at a major store. I duly added my new card information, saved it, and continued the checkout process to pay for my flight. To my surprise, the payment was rejected because I hadn’t updated my credit card information. Wait a minute, I thought, and checked my profile. The new card details were in my profile marked as the default card. So, I tried the transaction again, and everything worked fine.

我不知道这个系统是如何实现的,但我敢打赌它使用最终一致的数据库并且不支持读取您自己的写入(RYOW)。RYOW 是系统的一个属性,它确保如果客户端对数据进行持久更改,则更新的数据值保证由来自同一客户端的任何后续读取返回。

I don’t know exactly how this system was implemented, but I’m betting it uses an eventually consistent database and does not support read your own writes (RYOWs). RYOWs is a property of a system that ensures if a client makes a persistent change to data, the updated data value is guaranteed to be returned by any subsequent reads from the same client.

在最终一致的系统中,不一致窗口使客户端可以:

In an eventually consistent system, the inconsistency window makes it possible for a client to:

  • 发出对数据库对象键的更新。

  • Issue an update to a database object key.

  • 对同一数据库对象键发出后续读取,并在访问尚未保留先前更新的副本时查看旧值。

  • Issue a subsequent read for the same database object key and see the old value as it accesses a replica that has not yet persisted the prior update.

如图 11-1所示。客户端更新其信用卡详细信息的请求由副本 1协调,该副本将新的卡详细信息异步发送到其他副本。然而,副本 3的更新会出现延迟。在应用更新之前,同一客户端发出定向到副本 3 的读取。结果是读得陈旧。

This is illustrated in Figure 11-1. The client request to update their credit card details is coordinated by Replica 1, which sends the new card details asynchronously to the other replicas. The update to Replica 3 incurs a delay however. Before the update is applied, the same client issues a read which is directed to Replica 3. The result is a stale read.

最终一致性导致读取过时
图 11-1。最终一致性导致读取过时

为了避免这种情况,系统需要提供 RYOW 的一致性。2对于单个用户来说,这保证了该用户所做的任何更新都将在后续读取中可见。该保证不适用于其他用户。如果我向在线文章添加评论,当我重新加载页面时,我将看到我的评论。同时加载该页面的其他用户可能会也可能不会立即看到我的评论。他们最终会看到它。

To avoid this situation, a system needs to provide RYOWs consistency.2 This guarantees, for an individual user, that any updates made by the user will be visible in subsequent reads. The guarantee doesn’t hold for other users. If I add a comment to an online article, when I reload the page, I will see my comment. Other users who load the page at the same time may or may not see my comments immediately. They will see it eventually.

通过领导者-跟随者复制,实现读取写入的一致性很简单。对于需要 RYOW 的用例,您只需确保后续读取由领导副本处理即可。这保证保存最新的数据对象值。

With leader-follower replication, implementing read your writes consistency is straightforward. For use cases that require RYOWs, you simply ensure the subsequent read is handled by the leader replica. This is guaranteed to hold the latest data object value.

RYOW 的实现(如果支持)因数据库平台而异。对于 MongoDB(我将在第 13 章中更详细地描述的数据库),这是通过访问主副本实现的默认行为。3Neo4j 集群中,所有写入均由领导者处理,领导者异步更新只读追随者。然而,读取可以由副本处理。为了实现 RYOW 一致性,任何写入事务都可以请求返回唯一标识该更新的书签。后续读取请求将书签传递给 Neo4j,使集群能够确保只有已收到书签事务的副本才能处理读取。

The implementation of RYOWs, if supported, varies by database platform. With MongoDB, a database I’ll describe in more detail in Chapter 13, this is the default behavior achieved by accessing the master replica.3 In Neo4j clusters, all writes are handled by the leader, which asynchronously updates read-only followers. Reads, however, may be handled by replicas. To implement RYOWs consistency, any write transaction can request that a bookmark is returned that uniquely identifies that update. A subsequent read request passes the bookmark to Neo4j, enabling the cluster to ensure that only replicas that have received the bookmarked transaction process the read.

可调一致性

Tunable Consistency

许多最终一致的数据库提供配置选项和 API 参数使您能够定制数据库的最终一致行为。这使得可以根据用例可以容忍的最终副本一致性级别来权衡各个读取和写入操作的性能。基本方法称为可调一致性。

Many eventually consistent databases provide configuration options and API parameters to enable you to tailor the database’s eventually consistent behavior. This makes it possible to trade off the performance of individual read and write operations based on the level of eventual replica consistency a use case can tolerate. The basic approach is known as tunable consistency.

可调一致性基于指定请求必须访问才能完成数据库请求的副本数量。为了解释它是如何工作的,我们定义以下内容:

Tunable consistency is based on specifying the number of replicas that a request must access to complete a database request. To explain how this works, let’s define the following:

N
副本总数
Total number of replicas
W
副本数量在向客户端确认更新之前进行更新
Number of replicas to update before confirming the update to the client
R
返回值之前要读取的数量或副本
Number or replicas to read from before returning a value

例如,假设N = 3,并且有一个无领导者数据库,其中任何单独的请求都可以由任何一个副本处理。处理请求的副本称为协调器。您可以通过指定 W 值来调整写入操作性能和不一致窗口的范围,如以下示例所示:

As an example, assume N = 3, and there is a leaderless database in which any individual request can be handled by any one of the replicas. The replica handling the request is called the coordinator. You can tune write operation performance and the extent of the inconsistency window by specifying the W value as shown in the following examples:

宽=3
W = 3
请求协调器将等待所有三个副本都更新后再将成功返回给客户端。
The request coordinator will wait until all three replicas are updated before returning success to the client.
W = 1
W = 1
请求协调器将在本地确认更新并将成功返回给客户端。另外两个副本将异步更新。
The request coordinator will confirm the update locally and return success to the client. The other two replicas will be updated asynchronously.

这意味着如果W = 3,则写入完成后所有副本都将保持一致。这有时称为立即一致性。在这种情况下,客户端可以发出R = 1值(或仲裁 — 请参阅下一节)的读取,并且只要读取与副本更新不并发,它们就应该收到最新值。更新副本时发生的读取可能仍会看到不同的值,具体取决于它们访问的副本。只有当副本值收敛后,所有读取才会看到相同的值。因此,立即一致性与强一致性(参见第 12 章)不同,因为过时读取仍然是可能的。4

This means if W = 3, all replicas will be consistent after the write completes. This is sometimes called immediate consistency. In this case, clients can issue reads with a value of R = 1 (or quorum—see next section) and they should receive the latest value, as long as reads are not concurrent with the replica updates. Reads that occur while the replicas are being updated may still see different values depending on the replicas they access. Only once the replica values have converged will all reads see the same value. Hence immediate consistency is not the same as strong consistency (see Chapter 12) as stale reads are still possible.4

如果W = 1,那么您将面临一个不一致窗口,因为只有一个副本(我们示例中的请求协调器)保证具有最新值。如果您发出R = 1的读取,结果可能是也可能不是最新值。

If W = 1, then you have an inconsistency window as only one replica, the request coordinator in our example, is guaranteed to have the latest value. If you issue a read with R = 1, the result may or may not be the latest value.

还记得第 10 章中的 CAP 定理吗?这里需要考虑一些一致性和可用性的权衡。如果我们设置W = N,则会产生两个结果:

Remember the CAP theorem from Chapter 10? There are some consistency-availability trade-offs to consider here. If we set W = N, then there are two consequences:

  • 所有副本都是一致的。此选项有利于副本一致性。请注意,写入速度会较慢。客户端必须等待所有副本确认更新,这将增加写入延迟,尤其是在一个副本响应缓慢的情况下。

  • All replicas are consistent. This option favors replica consistency. Note that writes will be slower. The client must wait for updates to be acknowledged by all replicas, and this will add latency to writes, especially if one replica is slow to respond.

  • 如果副本不可访问,写入可能会失败。这将使请求协调器无法更新所有副本,因此请求将引发异常。这会对可用性产生负面影响(请参阅本章后面对提示切换的讨论)。

  • Writes may fail if a replica is not accessible. This would make it impossible for the request coordinator to update all replicas, and hence the request will throw an exception. This negatively affects availability (see discussion of hinted handoffs later in this chapter).

该选项是CAP 术语中的CP

This option is CP in CAP terminology.

或者,如果我们设置W = 1,则如果有任何副本可用,写入就会成功。将会出现一个不一致的窗口这将持续到所有副本都更新为止。即使一个或多个副本已分区或失败,写入也会成功。因此,此选项比副本一致性( CAP 术语中的AP)更注重可用性。

Alternatively, if we set W = 1, writes succeed if any replica is available. There will be an inconsistency window that will last until all replicas are updated. The write will succeed even if one or more replicas are partitioned or have failed. This option therefore favors availability over replica consistency, or AP in CAP parlance.

为了解决这种不一致窗口,客户端可以指定在返回结果之前应读取多少个副本。如果我们设置R = N,那么请求协调器将从所有副本中读取,确定哪个是最新更新,并将该值返回给客户端(我将在稍后详细介绍协调器如何确定哪个副本保存最新值)本章。现在假设这是可能的)。结果是,通过读取所有副本,您可以保证访问保存最新更新值的副本。

To combat this inconsistency window, a client can specify how many replicas should be read before a result is returned. If we set R = N, then the request coordinator will read from all replicas, determine which is the latest update, and return that value to the client (I’ll return to precisely how the coordinator determines which replica holds the latest value later in this chapter. For now just assume it is possible). The result is that by reading from all replicas, you are guaranteed to access the one that holds the latest updated value.

查看所涉及的权衡的另一种方法是读取优化写入优化。( W = N , R = 1) 设置有利于一致性和读取延迟,因为只需要访问一个副本。代价是写入时间更长。( W = 1, R = N ) 选项有利于可用性和写入延迟,因为更新任何副本后写入都会成功。代价是读取速度较慢。

Another way to look at the trade-offs involved is read optimized versus write optimized. The (W = N, R = 1) setting favors both consistency and read latencies, as only one replica needs to be accessed. The trade-off is longer write times. The (W = 1, R = N) option favors both availability and write latencies, as writes succeed after any replica is updated. The trade-off is slower reads.

这些设置使您能够调整各个数据库请求以满足您的要求。如果不希望出现不一致的读取,请选择W = NR = 1,这将增加写入延迟,但使读取尽可能快,或者选择W = 1 和R = N,以牺牲读取为代价来优化写入。如果您的用例可以容忍不一致,请设置W = R = 1 并受益于快速读写。或者,如果您想平衡性能和一致性,还有另一种选择,我将在下一节中解释。

These settings enable you to tune individual database requests to match your requirements. If inconsistent reads are not desirable, choose either W = N and R = 1, which will add latency to writes but make reads as fast as possible, or W = 1 and R = N, to optimize writes at the expense of reads. If your use cases can tolerate inconsistency, set W = R = 1 and benefit from fast reads and writes. Or, if you want to balance performance and consistency, there’s another option, as I’ll explain in the next section.

仲裁读取和写入

Quorum Reads and Writes

有一个选项介于上一节讨论的替代方案。这些被称为仲裁读取和写入。Quorum 简单来说就是多数,即 ( N / 2) + 1。5对于我们的三个副本,多数是两个。对于五个副本,大多数是三个,依此类推。

There’s an option that lies between the alternatives discussed in the previous section. These are known as quorum reads and writes. Quorum simply means the majority, which is (N / 2) + 1.5 For our three replicas, the majority is two. For five replicas, the majority is three, and so on.

如果我们将WR值都配置为仲裁,我们可以平衡读取和写入的性能,并且仍然提供对数据对象的最新更新值的访问。图 11-2说明了仲裁的工作原理。对于三个副本,仲裁意味着写入必须在两个副本上成功,并且读取必须访问两个副本。最初,所有三个副本都有一个值为v1的数据对象K ,并且会发生以下操作序列:

If we configure both the W and R value to be the quorum, we can balance the performance of reads and writes and still provide access to the latest updated value of a data object. Figure 11-2 illustrates how quorums work. With three replicas, a quorum means a write must succeed at two replicas, and a read must access two replicas. Initially all three replicas have a data object K with value v1, and the following sequence of actions takes place:

  1. 客户端 1更新对象以保存值v2 ,并且一旦仲裁(在本例中为副本 1副本 2)更新,写入就会被确认为成功。

  2. Client 1 updates the object to hold value v2 and the write is acknowledged as successful once a quorum—in this case Replica 1 and Replica 2—are updated.

  3. 更新到副本 3 的命令被延迟(网络慢?节点繁忙?)。

  4. The command to update to Replica 3 is delayed (slow network? busy node?).

  5. 客户端 2发出对对象K的读取。

  6. Client 2 issues a read on object K.

  7. 副本 2充当请求协调器,并向其他两个副本发送读取请求以获取其K值。副本 3首先响应K = v1

  8. Replica 2 acts as the request coordinator and sends a read request to the other two replicas for their value for K. Replica 3 is first to respond with K = v1.

  9. 副本 2将其K值与从副本 3返回的值进行比较,并确定v2是最近更新的值。它将值v2返回给客户端 2

  10. Replica 2 compares its value for K with that returned from Replica 3 and determines that v2 is the most recently updated value. It returns value v2 to Client 2.

法定人数的基本直觉是通过始终从大多数副本读取和写入,读取请求将看到数据库对象的最新版本。这是因为写入的大部分内容和读取的大部分内容必须重叠。在图 11-2中,即使在读取发生之前副本 3未更新,读取也会访问副本 2,副本 2 确实保存有更新的值。然后,请求协调器可以确保将最新值返回给客户端。

The basic intuition of quorums is that by always reading and writing from the majority of replicas, read requests will see the latest version of a database object. This is because the majority that is written to and the majority that are read from must overlap. In Figure 11-2, even though Replica 3 is not updated before the read takes place, the read accesses Replica 2, which does hold the updated value. The request coordinator can then ensure the latest value is returned to the client.

法定人数读取和写入
图 11-2。法定人数读取和写入

那么,这里不可避免的权衡是什么?简而言之,如果节点的法定数量不可用,则写入和读取将会失败。对一组副本进行分区导致客户端可见的分区不包含仲裁的网络故障将导致该客户端的请求失败。

So, what’s the inevitable trade-off here? Simply, writes and reads will fail if a quorum of nodes is not available. A network failure that partitions a group of replicas such that the partition visible to a client does not contain a quorum will cause that client’s requests to fail.

在一些设计为有利于可用性高于一致性,支持草率法定人数的概念。草率仲裁首先在 Amazon 的原始 Dynamo 论文6中进行了描述,并在多个数据库中实现,包括 DynamoDB、Cassandra、Riak 和 Voldemort。

In some database systems designed to favor availability over consistency, the concept of a sloppy quorum is supported. Sloppy quorums were first described in Amazon’s original Dynamo paper,6 and are implemented in several databases including DynamoDB, Cassandra, Riak, and Voldemort.

这个想法很简单。如果由于副本节点不可用而导致给定写入无法达到仲裁,则可以将更新临时存储在另一个可访问的节点上。当副本的主节点再次可用时,存储更新的节点将执行所谓的提示切换提示切换将副本的最新值从其临时位置发送到主节点。

The idea is simple. If a given write cannot achieve quorum due to the unavailability of replicas nodes, the update can be stored temporarily on another reachable node. When the home node(s) for the replica(s) become available again, the node storing the update performs what is called a hinted handoff. A hinted handoff sends the latest value of the replica to the home nodes from its temporary location.

该方案如图 11-3所示。客户端向副本 1发送更新。副本 1尝试更新副本 2副本 3,副本 3由于暂时网络分区而不可用。因此,副本 1将更新发送到另一个数据库节点Node N,该节点临时存储更新。一段时间后,节点 N将更新发送到副本 3,并且更新后的对象的值在所有副本中变得一致。

This scheme is depicted in Figure 11-3. The client sends an update to Replica 1. Replica 1 attempts to update Replica 2 and Replica 3, but Replica 3 is unavailable due to a transient network partition. Replica 1 therefore sends the update to another database node, Node N, which temporarily stores the update. Sometime later, Node N sends the update to Replica 3, and the value for the updated object becomes consistent across all replicas.

草率的法定人数和暗示的交接
图 11-3。草率的法定人数和暗示的交接

草率的法定人数有两个主要影响。首先,达到草率仲裁的写入保证了W节点上的持久性,但W节点并非所有保存更新数据对象副本值的节点。这意味着即使配置了仲裁(即R + W > N),客户端仍可能读取过时的值,因为它可能访问尚未被先前写入操作更新的R节点。

Sloppy quorums have two main implications. First, a write that has achieved a sloppy quorum guarantees durability on W nodes, but the W nodes are not all nodes that hold replica values of the updated data object. This means a client may still read a stale value, even with quorums configured (i.e., R + W > N), as it may access R nodes that have not been updated by the previous write operation.

其次,草率的仲裁会增加系统的写入可用性。代价是在发生暗示的切换之前可能会出现过时的读取。支持这些功能的数据库通常允许系统设计人员打开或关闭这些功能以适应应用需求。

Second, sloppy quorums increase write availability for a system. The trade-off is the potential for stale reads until the hinted handoff has occurred. Databases that support these features typically allow the system designer to turn these capabilities on or off to suit application needs.

复制品修复

Replica Repair

在分布式、可复制的数据库,您期望每个副本都是一致的。复制可能需要一段时间,但一致性始终是最终结果。不幸的是,在操作数据库中,会发生副本漂移。网络故障、节点停滞、磁盘崩溃或(但愿不会!)数据库代码中的错误可能会导致副本随着时间的推移变得不一致。

In a distributed, replicated database, you expect every replica will be consistent. Replication may take a while, but consistency is always the ultimate outcome. Unfortunately, in operational databases, replica drift occurs. Network failures, node stalls, disk crashes, or (heaven forbid!) a bug in the database code can cause replicas to become inconsistent over time.

一个术语来自热力学,熵,就是用来描述这种情况的。基本上,随着时间的推移,系统往往会变得熵(无序)。由于熵的存在,数据库需要采取主动措施来确保副本保持一致。这些措施统称为反熵修复。

A term from thermodynamics, entropy, is used to describe this situation. Basically, systems tend to entropy (disorder) over time. Because of entropy, databases need to take active measures to ensure replicas remain consistent. These measures are known collectively as anti-entropy repair.

反熵修复基本上有两种策略。一种是访问对象时应用的主动策略。这对于读取相当频繁的数据库对象非常有效。对于不经常访问的对象(很可能是绝大多数数据),使用被动修复策略。它在后台运行并搜索不一致的副本来修复。

There are basically two strategies for anti-entropy repair. One is an active strategy that is applied when objects are accessed. This works effectively for database objects that are read reasonably frequently. For infrequently accessed objects, most likely the vast majority of your data, a passive repair strategy is used. This runs in the background and searches for inconsistent replicas to fix.

主动修复

Active Repair

也称为读修复、主动副本修复是为了响应数据库读取请求而发生的。当读取到达协调器节点时,它会请求每个副本的最新值。如果任何值不一致,协调器会发回最新值来更新过时的副本。这可以在阻塞或非阻塞模式下完成。阻塞会等待副本确认更新,然后再响应客户端,而非阻塞会立即将最新值返回给客户端并异步更新过时的副本。

Also known as read repair, active replica repair takes place in response to database read requests. When a read arrives at a coordinator node, it requests the latest value for each replica. If any of the values are inconsistent, the coordinator sends back the latest value to update the stale replicas. This can be done in a blocking or nonblocking mode. Blocking waits for the replicas to confirm updates before responding to the client, whereas nonblocking returns the latest value to the client immediately and updates stale replicas asynchronously.

读取修复的具体工作方式取决于实现。需要考虑的因素是每次读取时访问了多少个副本(可能是全部副本、法定数量或特定R值)以及如何检测和修复副本分歧。为了进行检测,可以使用对象的哈希值,而不是请求和比较具有复杂结构的完整的、可能较大的对象。如果副本哈希匹配,则无需执行修复操作。读取哈希值(称为摘要读取)可以减少网络流量和延迟。您会在多个 NoSQL 数据库中找到摘要读取实现,例如ScyllaDB和 Cassandra。

Precisely how read repair works is implementation dependent. Factors to consider are how many replicas are accessed on each read—perhaps all, quorum or specific R value—and how replica divergence is detected and fixed. For detection, instead of requesting and comparing a complete, potentially large object with a complex structure, a hash value of the object can be used. If replica hashes match, then there is no need to perform a repair operation. Reading hashes, known as digest reads, reduces network traffic and latency. You’ll find digest read implementations in several NoSQL databases, for example, ScyllaDB and Cassandra.

被动修复

Passive Repair

被动反熵修复是一种通常定期运行的进程,旨在修复不经常访问的副本。本质上,该方法构建一个表示每个复制的对象集合的哈希值,并比较每个集合的哈希值。如果哈希值匹配,则无需修复。如果不这样做,您就知道集合中的某些副本不一致,需要采取进一步的操作。

Passive anti-entropy repair is a process that typically runs periodically and is targeted at fixing replicas that are infrequently accessed. Essentially, the approach builds a hash value that represents each replicated collection of objects and compares the hashes of each collection. If the hashes match, no repair is needed. If they don’t, you know some replicas in the collection are inconsistent and further action is needed.

为了创建可能非常大的数据对象集合的有效哈希表示,通常使用称为 Merkle 树7的数据结构。默克尔树是二叉哈希树,其叶节点是各个数据对象的哈希值。树中的每个父节点存储其子节点对的哈希,使得根节点哈希提供整个数据集合的紧凑表示。图 11-4显示了简单 Merkle 树的表示。

To create an efficient hash representation of a potentially very large collection of data objects, a data structure called a Merkle tree7 is typically utilized. A Merkle tree is a binary hash tree whose leaf nodes are hashes of individual data objects. Each parent node in the tree stores a hash of its pair of children nodes, such that the root node hash provides a compact representation of the entire data collection. Figure 11-4 shows a representation of a simple Merkle tree.

默克尔树示例
图 11-4。默克尔树示例

一旦构建了对象集合的 Merkle 树,就可以有效地利用它来比较每个副本集合的 Merkle 树。两个节点可以交换根节点哈希,如果根节点值相等,则分区中存储的对象是一致的。如果不是,则必须比较根的两个子节点。由于根节点哈希值不同,因此一个(或可能两者)子节点哈希值必须不同。遍历和数据交换算法基本上沿着树继续向下,遵循副本树之间哈希值不相等的分支,直到识别出叶节点。一旦识别出来,过时的数据对象就可以在适当的副本节点上更新。

Once a Merkle tree for a collection of objects has been constructed, it can be efficiently utilized to compare Merkle trees for each replica collection. Two nodes can exchange the root node hash, and if the root node values are equal, then the objects stored in the partitions are consistent. If they are not, the two child nodes of the root must be compared. One (or maybe both) of the child node hashes must be different as the root node hashes were different. The traversal and data exchange algorithm basically continues down the tree, following branches where hashes are not equal between replica trees, until leaf nodes are identified. Once identified, the stale data objects can be updated on the appropriate replica node.

Merkle 树的构建是一个 CPU 和内存密集型操作。由于这些原因,该过程要么按需启动,要么由管理工具启动,要么定期安排。这使得当数据库遇到低请求时可以进行反熵修复负载,因此不会导致生产期间数据库访问延迟增加。实现反熵修复的 NoSQL 数据库的例子有Riak和 Cassandra。

Merkle tree construction is a CPU- and memory-intensive operation. For these reasons, the process is either initiated on demand, initiated by an administration tool, or scheduled periodically. This enables anti-entropy repair to occur when the database is experiencing a low request load, and hence doesn’t cause increased latencies on database accesses during production. Examples of NoSQL databases that implement anti-entropy repair are Riak and Cassandra.

处理冲突

Handling Conflicts

到目前为止,在本章中,我假设数据库具有某种机制来识别任何给定复制数据库对象的最新值。例如,当从三个副本读取时,数据库将能够以某种方式确定哪个副本是最近更新的,并将该值作为查询结果返回。

Up until now in this chapter, I’ve assumed that a database has some mechanism to discern the latest value for any given replicated database object. For example, when reading from three replicas, the database will somehow be able to decide which replica is the most recently updated and return that value as the query result.

在无领导者系统中,写入可以由任何副本处理。这使得两个客户端可以同时对不同副本上的同一数据库键应用独立更新。发生这种情况时,应按什么顺序应用更新?所有副本的最终值应该是多少?你需要某种机制来使这个决定成为可能。

In a leaderless system, writes can be handled by any replica. This makes it possible for two clients to concurrently apply independent updates to the same database key on different replicas. When this occurs, in what order should the updates be applied? What should be the final value that all replicas hold? You need some mechanism to make this decision possible.

最后一位作家获胜

Last Writer Wins

决定最终确定值的一种方法是使用时间戳。为更新请求生成时间戳,数据库确保发生并发写入时,具有最新时间戳的更新成为最终版本。从数据库的角度来看,这既简单又快速。

One way to decide final, definitive values is to use timestamps. A timestamp is generated for the update request and the database ensures that when concurrent writes occur, the update with the most recent timestamp becomes the final version. This is simple and fast from the database perspective.

不幸的是,这种方法有一个问题。更新到底是按照什么顺序发生的?正如我在第 3 章中所描述的,机器上的时钟会发生漂移。这意味着一个节点的时钟可能领先于其他节点,使得比较时间戳毫无意义。事实上,我们无法确定事件的顺序。它们由两个或多个独立进程在同一数据对象的不同副本上执行。这些更新必须被视为同时进行或同时进行。附加到更新的时间戳只是对更新强加任意顺序以解决冲突。

Unfortunately, there’s a problem with this approach. In what order did the updates really happen? As I described in Chapter 3, clocks on machines drift. This means one node’s clock may be ahead of others, making comparing timestamps meaningless. In reality, we can’t determine the order of the events. They are executed on different replicas of the same data object by two or more independent processes. These updates must be considered as simultaneous, or concurrent. The timestamps attached to the updates simply impose an arbitrary order on the updates for conflict resolution.

这样做的结果是,当使用最后一个写入者获胜进行并发更新时,更新将被默默地丢弃。图 11-5以共享播放列表为例描述了更新丢失的一种场景。客户端 1将第一个条目写入播放列表,并且该条目随后由客户端 1客户端 2在稍后的某个时间读取。然后,两个客户端都向播放列表写入一个新条目,但由于客户端 2的更新时间戳晚于客户端 1的更新,因此客户端 1所做的更新会丢失。

The consequence of this is when concurrent updates occur using last writer wins, updates will be silently discarded. Figure 11-5 depicts one scenario where updates are lost using a shared playlist as an example. Client 1 writes the first entry to the playlist, and this entry is subsequently read at some time later by both Client 1 and Client 2. Both clients then write a new entry to the playlist, but as Client 2’s update is timestamped later than Client 1’s, the updates made by Client 1 are lost.

并发写入导致更新丢失,最后写入者获胜
图 11-5。并发写入导致更新丢失,最后写入者获胜

数据丢失与最后一个写入者获胜的冲突解决策略是不可避免的。有一些缓解策略,例如各个字段上的时间戳和条件写入(我将在第13 章中讨论),可以最大限度地减少或减轻数据丢失的可能性。然而,安全地利用纯粹采用最后写入者获胜策略的数据库的唯一方法是确保所有写入都使用唯一键存储数据对象,并且对象随后是不可变的。对数据库中数据的任何更改都需要读取现有数据对象和新内容使用新密钥写入数据库。

Data loss with a last writer wins conflict resolution policy is inevitable. There are mitigation strategies such as timestamps on individual fields and conditional writes (which I’ll discuss in Chapter 13) that can minimize or mitigate the likelihood of data loss. However, the only way to safely utilize a database that employs purely a last writer wins policy is to ensure all writes store data objects with a unique key, and objects are subsequently immutable. Any changes to data in the database require the existing data object to be read and the new contents written to the database with a new key.

版本向量

Version Vectors

处理并发更新为了不丢失数据,我们需要一种方法来识别和解决冲突。图 11-6显示了使用版本控制为单个副本实现此目的的方法。每个唯一的数据库对象都与版本号一起存储。

To handle concurrent updates and not lose data, we need a way to identify and resolve conflicts. Figure 11-6 shows an approach to achieving this for a single replica using versioning. Each unique database object is stored along with a version number.

版本控制冲突识别
图 11-6。版本控制冲突识别

从数据库中读取和写入数据的过程如下:

Reading and writing data from the database proceeds as follows:

  • 当客户端读取数据库对象时,将返回该对象及其版本。

  • When a client reads a database object, the object and its version are returned.

  • 当客户端更新数据库对象时,它会写入新的数据值以及从先前读取接收到的对象版本。

  • When a client updates a database object, it writes the new data values and the version of the object that was received from the previous read.

  • 数据库检查写入请求中的版本是否与数据库中对象的版本相同,如果相同,则接受写入并递增版本号。

  • The database checks that the version in the write request is the same as the object’s version in the database, and if it is, it accepts the write and increments the version number.

  • 如果写入时伴随的版本号与数据库对象版本不匹配,则发生冲突,数据库必须采取补救措施以确保数据不会丢失。它可能会向客户端返回错误并使其重新读取新版本。或者,它可以存储这两个更新并通知客户端发生了冲突。

  • If the version number accompanying a write does not match the database object version, a conflict has occurred, and the database must take remedial action to ensure data is not lost. It may return an error to the client and make it reread the new version. Alternatively, it may store both updates and inform the client that a conflict has occurred.

图 11-6中,所描述的补救措施基于Riak 数据库。当 Riak 中发生冲突时,数据库会存储数据库对象的两个版本并将冲突返回给客户端。在此示例中,客户端的解决方案是简单地合并两个更新,这两个更新在 Riak 中称为同级更新

In Figure 11-6, the remedial action depicted is based on the Riak database. When a conflict occurs in Riak, the database stores both versions of the database object and returns the conflicts to the client. In this example, the resolution is for the client to simply merge the two updates, which are known as siblings in Riak.

然而,对于多个副本,情况比图 11-6中的情况要复杂一些。由于写入可以由任何副本处理,因此我们需要维护每个唯一对象和每个副本的版本号。副本在处理写入时维护自己的版本,并且还跟踪它所看到的其他副本的版本。这将创建所谓的版本向量

With multiple replicas, however, the situation is somewhat more complex than in Figure 11-6. As writes may be handled by any replica, we need to maintain the version number for each unique object and each replica. Replicas maintain their own version as writes are processed, and also keep track of the versions of the other replicas it has seen. This creates what is known as a version vector.

当副本接受来自客户端的写入时,它会更新自己的版本号,并将更新请求及其版本向量发送到其他副本。副本使用版本向量来决定是否应接受更新或是否应创建同级。

When a replica accepts a write from a client, it updates its own version number and sends the update request along with its version vector to the other replicas. The version vector is used by a replica to decide whether the update should be accepted or if siblings should be created.

版本向量的管理是数据库的责任。数据库客户端只需要提供最新版本的更新,并能够在冲突发生时进行处理。下面的边栏简要概述了由版本向量表示的冲突解决方法背后的一些理论。

The management of version vectors is the responsibility of the database. Database clients just need to present the latest version with updates and be able to handle conflicts when they occur. The following sidebar gives a brief overview of some of the theory behind the conflict resolution approach represented by version vectors.

如图11-6所示,当数据库检测到冲突时,客户端通常需要做一些工作来解决它。如果数据库抛出错误,客户端可以重新读取数据库对象的最新版本并再次尝试更新。如果返回同级,客户端必须执行某种形式的合并。这些情况是非常特定于用例的,因此不可能一概而论。

As Figure 11-6 illustrates, when a database detects a conflict, the client typically needs to do some work to resolve it. If the database throws an error, the client can reread the latest version of the database object and attempt the update again. If siblings are returned, the client must perform some form of merge. These situations are very use case–specific and hence impossible to generalize.

幸运的是,在某些情况下数据库可以自动解决冲突。一些数据库,包括 Redis、Cosmos DB 和 Riak,正在利用来自研究社区支持称为无冲突复制数据类型 (CRDT) 的数据类型集合。CRDT 的语义使得它们可以同时更新,并且任何冲突都可以由数据库明智地解决。CRDT 的值将始终收敛到在所有副本上一致的最终状态。

Luckily, there are circumstances when a database can automatically resolve conflicts. Some databases, including Redis, Cosmos DB, and Riak, are leveraging recent results from the research community to support a collection of data types known as conflict-free replicated data types (CRDTs). CRDTs have semantics such that they can be concurrently updated and any conflicts can be resolved sensibly by the database. The value of a CRDT will always converge to a final state that is consistent on all replicas.

CRDT 的一个简单示例是一个计数器,可用于维护社交媒体网站上用户的关注者数量。计数器的递增和递减可以以任何顺序应用于不同的副本,并且结果值最终应在所有副本上收敛。

A simple example of a CRDT is a counter that could be used to maintain the number of followers for a user on a social media site. Increments and decrements to a counter can be applied in any order on different replicas, and the resulting value should eventually converge on all replicas.

常见的 CRDT 包括集合、哈希表、列表和日志。这些数据结构的行为与非分布式数据结构相同,但有一些小的警告。9重要的是,它们减轻了应用程序处理冲突的负担。这简化了应用程序逻辑,可以节省您的时间和金钱,并且可能会使您的应用程序不易出错。

Common CRDTs include sets, hash tables, lists and logs. These data structures behave identically to their nondistributed counterparts, with minor caveats.9 Importantly, they alleviate the application from the burden of conflict handling. This simplifies application logic, saving you time and money, and will probably make your applications less error prone.

总结和延伸阅读

Summary and Further Reading

最终一致的数据库已成为可扩展分布式系统领域的既定组成部分。简单、可演化的数据模型可以自然地分区和复制以实现可扩展性和可用性,为许多互联网规模的系统提供了出色的解决方案。

Eventually consistent databases have become an established part of the landscape of scalable distributed systems. Simple, evolvable data models that are naturally partitioned and replicated for scalability and availability provide an excellent solution for many internet-scale systems.

最终一致性不可避免地会为系统提供陈旧读取的机会。因此,大多数数据库都提供可调的一致性。这使得系统设计人员能够平衡读取和写入的延迟,并权衡可用性和一致性以满足应用程序需求。

Eventual consistency inevitably creates opportunities for systems to deliver stale reads. As a consequence, most databases provide tunable consistency. This allows the system designer to balance latencies for read and writes, and trade off availability and consistency to meet application needs.

对同一数据库对象的不同副本的并发写入可能会导致冲突。这会导致数据库具有不一致的副本并默默地丢失更新,这在大多数系统中都是不可取的。为了解决这个问题,需要冲突解决机制。这些通常需要应用程序逻辑(和/或用户)来解决冲突并确保更新不会丢失。这可能会导致额外的应用程序复杂性。然而,新的研究正在救援,一些数据库支持具有自动解决冲突的语义的数据类型。

Concurrent writes to different replicas of the same database object can cause conflicts. These cause the database to have inconsistent replicas and to silently lose updates, neither of which are desirable in most systems. To address this problem, conflict resolution mechanisms are required. These often need application logic (and/or users) to resolve conflicts and ensure updates are not lost. This can cause additional application complexity. New research is coming to the rescue, however, and some databases support data types that have semantics to automatically resolve conflicts.

最终一致数据库的经典参考和当今许多实现的灵感来自原始的 Dynamo 论文。该书自 2007 年出版十多年来,仍然值得一读。

The classic reference for eventually consistent databases and the inspiration for many of today’s implementations is the original Dynamo paper. It is still a great read over a decade after its 2007 publication.

如果您想更多地了解如何构建最终一致的数据库(以及一般的海量数据存储),我认为没有比Martin Kleppman 的《设计数据密集型应用程序》(O'Reilly,2017 年)更好的来源了。我还喜欢Dan Sullivan 的《NoSQL for Mere Mortals》(Addison-Wesley Professional,2015),其中提供了扎实的基本信息。如果您想了解有关 CRDT 的更多信息,这篇评论论文是一个很好的起点。14

If you want to know more about how eventually consistent databases—and massive data stores in general—are built, I can think of no better source than Designing Data-Intensive Applications by Martin Kleppman (O’Reilly, 2017). I also enjoy NoSQL for Mere Mortals by Dan Sullivan (Addison-Wesley Professional, 2015) for solid basic information. And if you want to know more about CRDTs, this review paper is a great place to start.14

1未来的量子通信系统很可能克服距离引起的延迟

1 The quantum communications systems of the future may well overcome distance-induced latencies.

2有关最终一致性模型的精彩描述,请参阅https://oreil.ly/2qUQh

2 For an excellent description of eventual consistency models, see https://oreil.ly/2qUQh.

3 MongoDB 使您能够调整哪个副本处理读取

3 MongoDB enables you to tune which replica handles reads.

4立即一致性不应与强一致性相混淆,如下章所述。注重可用性而非一致性并使用W = N进行写入的数据库仍然存在不一致窗口。此外,如果一个或多个副本无法访问,则可能无法在每个副本上完成写入。要了解原因,请参阅本章后面对暗示切换的讨论。

4 Immediate consistency should not be confused with strong consistency, as described in the next chapter. Databases that favor availability over consistency and use W = N for writes still have an inconsistency window. Also, writes may not be completed on every replica if one or more is unreachable. To understand why, see discussion of hinted handoffs later in this chapter.

5该公式使用整数除法,因此小数部分被丢弃。例如,5 / 2 = 2。

5 This formula uses integer division, such that the fractional part is discarded. For example, 5 / 2 = 2.

6 Giuseppe DeCandia 等人,“Dynamo:Amazon 的高可用性键值存储”,第 21 届 ACM 操作系统原理研讨会论文集,华盛顿州史蒂文森,2007 年 10 月。

6 Giuseppe DeCandia et al., “Dynamo: Amazon’s Highly Available Key-Value Store,” in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

7 Merkle 树在许多用例中都很有用,包括比特币/区块链交易验证。

7 Merkle trees are useful in many use cases, including Bitcoin/blockchain transaction verifications.

8 Leslie Lamport,“分布式系统中的时间、时钟和事件排序”,21通讯,第 2 期。7(1978):558-65。https://doi.org/10.1145/359545.359563

8 Leslie Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System,” Communications of the ACM 21, no. 7 (1978): 558–65. https://doi.org/10.1145/359545.359563.

9例如,如果一个进程从集合中删除一个元素,同时另一个节点添加相同的元素,则“添加获胜”。

9 For example, if one process removes an element from a set concurrently with another node adding the same elements, the “add wins.”

10有关 Netflix 云数据工程团队如何使用 Cassandra 管理 PB 级数据的更多详细信息,请查看此精彩演示

10 For more details on how Netflix’s cloud data engineering team uses Cassandra to manage data at petabyte scale, check out this excellent presentation.

11这篇Netflix 技术博客文章解释了基于 Cassandra 的用于扩展用户观看历史数据库的解决方案如何随着时间的推移而发展。

11 This Netflix technical blog entry explains how the Cassandra-based solution for scaling the user viewing history database evolved over time.

12 个Cassandra 基准测试表明了最终一致数据存储的能力

12 Cassandra benchmarks are indicative of the capabilities of eventually consistent data stores.

13 bet365 如此依赖 Riak KV,它实际上购买了技术!此处概述了其业务挑战。

13 bet365 relies so heavily on Riak KV, it actually bought the technology! An overview of its business challenges is described here.

14 Marc Shapiro 等人,“收敛和可交换的复制数据类型”,欧洲理论计算机科学协会;1999,2011,67–88。⟨hal-00932833⟩

14 Marc Shapiro et al., “Convergent and Commutative Replicated Data Types,” European Association for Theoretical Computer Science; 1999, 2011, 67–88. ⟨hal-00932833⟩.

第 12 章强一致性

Chapter 12. Strong Consistency

正如我在第 11 章中所描述的,最终一致的数据库旨在通过允许数据集在多台机器上分区和复制来进行扩展。可扩展性的实现是以跨副本保持强数据一致性和允许冲突写入为代价的。

As I described in Chapter 11, eventually consistent databases are designed to scale by allowing data sets to be partitioned and replicated across multiple machines. Scalability is achieved at the expense of maintaining strong data consistency across replicas, and allowing conflicting writes.

这些权衡的后果是双重的。首先,更新数据对象后,不同的客户端可能会看到该对象的旧值或新值,直到所有副本收敛到最新值。其次,当多个客户端同时更新一个对象时,应用程序负责确保数据不会丢失,并且最终的对象状态反映了并发更新操作的意图。根据系统的要求,处理不一致和冲突可能很简单,也可能会给应用程序代码增加相当大的复杂性。

The consequences of these trade-offs are twofold. First, after a data object has been updated, different clients may see either the old or new value for the object until all replicas converge on the latest value. Second, when multiple clients update an object concurrently, the application is responsible for ensuring data is not lost and the final object state reflects the intent of the concurrent update operations. Depending on your system’s requirements, handling inconsistency and conflicts can be straightforward, or add considerable complexity to application code.

另一类分布式数据库提供了一种替代模型,即强一致性数据系统。强一致性系统也称为 NewSQL,或者最近称为分布式 SQL,它试图确保所有客户端在数据对象更新后看到相同、一致的值。他们还提供众所周知的原子性、一致性、隔离性、持久性 (ACID) 数据库事务处理冲突更新的好处。

Another class of distributed databases provides an alternative model, namely strongly consistent data systems. Also known as NewSQL or, more recently, distributed SQL, strongly consistent systems attempt to ensure all clients see the same, consistent value of a data object once it has been updated. They also deliver the well-known benefits of atomicity, consistency, isolation, durability (ACID) database transactions to handle conflicting updates.

事务和数据一致性是现有单节点关系数据库中每个人都熟悉的特征,消除了最终一致系统固有的许多复杂性。它们一起可以显着简化应用程序逻辑。正如谷歌最初的 Spanner 分布式数据库论文中所述:“我们认为,最好让应用程序程序员在出现瓶颈时处理由于过度使用事务而导致的性能问题,而不是总是围绕缺乏事务进行编码。” 1

Transactions and data consistency, the characteristics everyone is familiar with in existing single-node relational databases, eliminate many of the complexities inherent in eventually consistent systems. Together they can significantly simplify application logic. As stated in Google’s original Spanner distributed database paper: “We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.”1

对于互联网规模的系统,技巧当然是提供强一致数据库的好处,以及最终一致系统可以实现的性能和可用性。这是分布式 SQL 数据库正在应对的挑战。在本章中,我将解释这些强一致系统的特征,以及使一致数据系统能够分区和复制以实现可扩展性和可用性所需的算法。

For internet-scale systems, the trick of course is to provide the benefits of strongly consistent databases, along with the performance and availability that eventually consistent systems can achieve. This is the challenge that distributed SQL databases are tackling. In this chapter, I’ll explain the characteristics of these strongly consistent systems and the algorithms required to make it possible for consistent data systems to be partitioned and replicated for scalability and availability.

强一致性简介

Introduction to Strong Consistency

在顺序程序中,一旦将值 (x) 写入变量,您预计所有后续读取都将返回 (x)。如果这个保证不成立,因为它不适用于并发程序如果没有仔细的线程同步,编写软件系统将会更加困难。

In sequential programs, once you write a value (x) to a variable, you expect all subsequent reads will return (x). If this guarantee didn’t hold, as it doesn’t for concurrent programs without careful thread synchronization, writing software systems would be a lot more fraught.

然而,当您使用最终一致的数据库系统时,就会出现这种情况。客户端可能认为它已向数据对象写入了新值,但其他客户端可能会访问同一对象并收到过时的值,直到不一致窗口关闭并且所有副本值都收敛为止。事实上,正如我在第 11 章中所描述的,除非支持 RYOW 一致性,否则客户端甚至可能访问它成功更新的对象并收到过时的值。

This, however, is the case when you use an eventually consistent database system. A client may think it has written a new value to a data object, but other clients may access the same object and receive a stale value until the inconsistency window closes and all replica values have converged. In fact, as I described in Chapter 11, a client may even access an object it successfully updated and receive a stale value unless RYOWs consistency is supported.

在基于最终一致数据库的系统中,应用程序必须了解底层数据存储的精确一致性保证,并设计为相应地处理这些问题。处理不一致的读取和并发写入冲突可能会给代码库和测试用例增加相当大的复杂性。如果不采取适当的措施,难以重现的错误可能会蔓延到应用程序中。根据墨菲定律,只有当系统经历高负载或意外故障时,这些问题才会不可避免地变得明显。

In systems based on eventually consistent databases, applications must be aware of the precise consistency guarantees of the underlying data store, and be designed to deal with these accordingly. Handling inconsistent reads and concurrent write conflicts can add considerable complexity to code bases and test cases. If you do not take appropriate care, difficult-to-reproduce errors can creep into applications. Following Murphy’s law, these will inevitably only become apparent when the system experiences high load or unexpected failures.

相比之下,强一致性数据库旨在提供与单节点系统相同的一致性保证。凭借强一致性,您可以在编写应用程序时确保一旦数据库确认更新,所有客户端的所有后续读取都将看到新值。如果并发客户端尝试更新同一对象,则更新的行为就像一个更新先于另一个更新一样。它们不会同时发生并导致数据丢失或损坏。

In contrast, strongly consistent databases aim to deliver the same consistency guarantees as single-node systems. With strong consistency, you can write applications with assurances that once an update has been confirmed by the database, all subsequent reads by all clients will see the new value. And if concurrent clients attempt to update the same object, the updates behave as if one happens before the other. They do not occur concurrently and cause data loss or corruption.

有点令人困惑的是,技术社区使用强一致性来描述分布式数据库中两个略有不同的概念。这些都是:

Slightly confusingly, the technical community uses strong consistency to describe two subtly different concepts in distributed databases. These are:

事务一致性
Transactional consistency
这是关系数据库支持的ACID 事务中的“C”(请参阅​​“ACID 事务” )。在分布式对于支持 ACID 事务的数据库,您需要一种算法,能够在单个事务中更新来自不同物理数据分区和节点的数据对象时保持一致性。这种情况下的一致性是由事务中执行的业务逻辑的语义定义的。
This is the “C” in ACID transactions (see “ACID Transactions”) as supported by relational databases. In a distributed database that supports ACID transactions, you need an algorithm that makes it possible to maintain consistency when data objects from different physical data partitions and nodes are updated within a single transaction. Consistency in this case is defined by the semantics of the business logic executed within the transaction.
副本一致性
Replica consistency
强副本一致性意味着客户端在数据对象更新后都会看到相同的值,无论它们访问哪个副本。基本上,这消除了我在第 11 章中介绍的最终一致系统中的不一致窗口。支持强副本一致性有各种固有的微妙之处,我将在本章后面探讨。
Strong replica consistency implies that clients all see the same value for a data object after it has been updated, regardless of which replica they access. Basically, this eliminates the inconsistency window I covered in Chapter 11 in eventually consistent systems. There are various subtleties inherent in supporting strong replica consistency that I will explore later in this chapter.

用于交易的算法和副本一致性被称为共识算法。这些算法使分布式系统中的节点能够就某些共享状态的值达成共识或一致。为了事务一致性,事务中的所有参与者都必须同意提交或中止事务内执行的更改。为了副本一致性,所有副本需要就复制数据对象的相同更新顺序达成一致。

The algorithms used for transactional and replica consistency are known as consensus algorithms. These algorithms enable nodes in a distributed system to reach consensus, or agreement, on the value of some shared state. For transactional consistency, all participants in the transaction must agree to commit or abort the changes executed within the transaction. For replica consistency, all replicas need to agree on the same order of updates for replicated data objects.

事务和副本一致性的解决方案是由不同的技术社区在不同的时间开发的。对于事务一致性,两阶段提交算法起源于数据库系统先驱之一 Jim Gray 在 1978 年的工作。2经典的副本一致性算法 Paxos 于 1998 年由 Leslie Lamport 首次描述。3我将在本章的剩余部分中探讨事务和副本一致性以及如何在分布式 SQL数据库中使用这些算法。

Solutions for transactional and replica consistency were developed by different technical communities at different times. For transactional consistency, the two-phase commit algorithm originated from work by Jim Gray, one of the pioneers of database systems, in 1978.2 The classic replica consistency algorithm, Paxos, was first described in 1998 by Leslie Lamport.3 I’ll spend the rest of this chapter exploring transaction and replica consistency and how these algorithms are used in distributed SQL databases.

一致性模型

Consistency Models

数据库和分布式系统社区对一致性的研究已有四十多年了。每个都有开发了几种不同的一致性模型,它们的语义和保证略有不同。这导致了定义和术语过多的混乱和复杂的情况。如果您对完整的细节感兴趣, Jepsen 网站上有对不同模型及其按层次结构组织的关系的精彩描述。本章我将只关注最强的一致性模型。这被称为严格一致性、严格可串行性或外部一致性,并且意味着数据库和分布式系统社区定义的两种最具限制性的一致性模型的组合。它们分别是可串行化和可线性化,如下所述:

The database and distributed systems communities have studied consistency for more than four decades. Each has developed several different consistency models that have subtly different semantics and guarantees. This has led to a somewhat confusing and complex landscape of definitions and overloaded terminology. If you are interested in the full details, there is an excellent depiction of the different models and their relationships organized as a hierarchy on the Jepsen website. I’ll just focus on the strongest consistency model in this chapter. This is known variously as strict consistency, strict serializability or external consistency, and implies the combination of the two most restrictive consistency models defined by the database and distributed systems communities. These are serializability and linearizability respectively, as explained in the following:

可串行化
Serializability
这就是人们常提到的事务一致性,ACID 中的“C”。事务对多个数据对象执行一次或多次读取和写入。可串行性保证在多个项目上执行一组并发事务相当于事务的某种顺序执行顺序。
This is commonly referred to as transactional consistency, the “C” in ACID. Transactions perform one or more reads and writes on multiple data objects. Serializability guarantees that the execution of a set of concurrent transactions over multiple items is equivalent to some sequential execution order of the transactions.
线性度
Linearizability
这与读取有关并写入单个数据对象。基本上,它表示所有客户端都应该始终看到数据对象的最新值。一旦对数据对象的写入成功,写入后发生的所有后续读取都必须返回该写入的值,直到再次修改该对象。线性化定义了使用挂钟时间的操作顺序,使得具有较新挂钟时间的操作发生在具有较低挂钟时间的任何操作之后。在具有多个数据对象副本的分布式数据库中,线性化一致性与副本一致性有关,本质上是 CAP 定理中的“C”。
This is concerned with reads and writes to single data objects. Basically, it says that all clients should always see the most recent value of a data object. Once a write to a data object succeeds, all subsequent reads that occur after the write must return the value of that write, until the object is modified again. Linearizability defines the order of operations using wall clock time, such that an operation with a more recent wall clock time occurs after any operations with lower wall clock times. In distributed databases with multiple data object replicas, linearizable consistency is concerned with replica consistency, essentially the “C” in the CAP theorem.

结合这两个模型可以提供最强的数据一致性。基本效果是事务以串行顺序执行(串行化),并且该顺序由事务的挂钟时间定义(线性化)。为了简单起见,我将其称为强一致性

Combining these two models gives the strongest possible data consistency. The basic effect is that transactions execute in a serial order (serializability), and that order is defined by the wall clock times of the transactions (linearizability). For simplicity, I’ll refer to this as strong consistency.

无论如何,这是理论的总结。为了支持分布式 SQL 数据库中的这些一致性模型,我们需要一致性算法,正如我在本章其余部分中所解释的那样。

Anyway, that’s a summary of the theory. To support these consistency models in distributed SQL databases, we require consensus algorithms, as I explain in the rest of this chapter.

分布式事务

Distributed Transactions

从应用程序开发人员的角度来看,考虑事务的最简单方法是将其视为简化事务的工具分布式系统中的故障场景。应用程序只需定义哪些操作必须使用 ACID 属性来执行,数据库将完成其余的工作。这大大降低了应用程序的复杂性,因为您可以忽略微妙且众多的失败可能性。您的代码只需等待数据库通知其事务结果(提交或中止)并执行相应的操作。

From an application developer’s perspective, the simplest way to think of transactions is as a tool to simplify failure scenarios in distributed systems. The application simply defines which operations must be carried out with ACID properties, and the database does the rest. This greatly reduces the application complexity, as you can ignore the subtle and numerous failure possibilities. Your code simply waits for the database to inform it of the transaction outcome (commit or abort) and behaves accordingly.

示例 12-1显示了使用 YugabyteDB 的 SQL 变体进行购买交易的简单示例。4事务修改stock表以反映客户订购的商品数量,并在表中插入新行purchases以表示客户的订单。这些操作是用事务边界定义的,并由BEGIN/END TRANSACTION语法标记。

Example 12-1 shows a simple example of a purchasing transaction using the SQL variant of YugabyteDB.4 The transaction modifies the stock table to reflect the number of items ordered by the customer, and inserts a new row in the purchases table to represent the customer’s order. These operations are defined with a transaction boundary, marked by the BEGIN/END TRANSACTION syntax.

例12-1。YugabyteDB 事务示例
开始交易
更新库存集 in_stock = in_stock - 购买金额
WHERE 库存_id = 购买_库存_id;
    插入购买(cust_id、stock_id、金额)
           VALUES(客户、购买库存 ID、购买金额);
结束交易;
BEGIN TRANSACTION
UPDATE stock SET in_stock = in_stock - purchase_amount 
WHERE stock_id = purchase_stock_id;
    INSERT INTO purchases (cust_id, stock_id, amount) 
           VALUES (customer, purchase_stock_id, purchase_amount);
END TRANSACTION;

事务语义确保两个操作要么成功,要么失败。如果数据库不支持事务(就像大多数 NoSQL 数据库一样),那么应用程序程序员实际上必须将事务分解为两个单独的更新,并定义可能复杂的异常处理。基本上,这意味着:

Transactional semantics ensure that both operations either succeed or fail. If a database does not support transactions, as in most NoSQL databases, the application programmer would effectively have to break the transaction down into two individual updates and define potentially complex exception handling. Basically, this would mean:

  • 单独执行每个更新,并检查每个更新是否成功。

  • Performing each update separately, and checking that each succeeds.

  • 如果INSERT成功后又失败UPDATE,则必须使用另一个 SQL 语句撤消库存表更新。这称为补偿动作。

  • If the INSERT fails after the UPDATE succeeds, the stock table updates must be undone using another SQL statement. This is known as a compensating action.

  • 如果补偿操作失败,或者执行代码的服务失败,您需要采取补救措施。这就是事情开始变得非常复杂的地方!

  • If the compensating action fails, or the service executing the code fails, you need to take remedial actions. This is where things start to get really complicated!

在单节点数据库中,提交事务相对简单。数据库引擎确保事务修改和状态通过事务日志文件持久保存到磁盘上。如果数据库引擎发生故障,可以在重新启动时利用事务日志将数据库恢复到一致状态。但是,如果示例 12-1purchases中的和stock表驻留在不同的数据库或分布式数据库的不同分区中,则该过程会稍微复杂一些。您需要一种算法来确保两个节点就交易结果达成一致。

In a single node database, committing a transaction is relatively straightforward. The database engine ensures transaction modifications and state are persisted to disk in a transaction log file. Should the database engine fail, the transaction log can be utilized on restart to restore the database to a consistent state. However, if the purchases and stock tables from Example 12-1 reside in different databases or different partitions in a distributed database, the process is somewhat more complex. You need an algorithm to ensure that both nodes agree on the transaction outcome.

两阶段提交

Two-Phase Commit

两阶段提交(2PC)是经典的分布式事务共识算法。它在已建立的SQL Server 和 Oracle 等关系数据库,以及 VoltDB 和 Cloud Spanner 等现代分布式 SQL 平台。2PC 还受到外部中间件平台的支持,例如 Java Enterprise Edition,其中包括Java Transaction API (JTA) 和 Java Transaction Service (JTS)。这些外部协调器可以使用 XA 协议驱动跨异构数据库的分布式事务。5

Two-phase commit (2PC) is the classic distributed transaction consensus algorithm. It is widely implemented in established relational databases like SQL Server and Oracle, as well as contemporary distributed SQL platforms including VoltDB and Cloud Spanner. 2PC is also supported by external middleware platforms such as the Java Enterprise Edition, which includes the Java Transaction API (JTA) and Java Transaction Service (JTS). These external coordinators can drive distributed transactions across heterogeneous databases using the XA protocol.5

图 12-1给出了基于例 12-1 的基本 2PC 协议的示例。该协议由协调者或领导者驱动。协调器可以是外部服务(例如 JTS),也可以是内部数据库服务。在分布式 SQL 数据库中,协调器可以是作为多分区事务更新的一部分进行更新的分区之一。

Figure 12-1 illustrates an example of the basic 2PC protocol based on Example 12-1. The protocol is driven by a coordinator, or leader. The coordinator can be an external service, for example the JTS, or an internal database service. In a distributed SQL database, the coordinator can be one of the partitions that is being updated as part of a multipartition transactional update.

当数据库客户端启动一个事务(例如例12-1BEGIN TRANSACTION中的语句)时,就会选择一个协调器。协调器分配一个全局唯一的事务标识符 ( tid ) 并将其返回给客户端。tid标识由协调器维护的数据结构称为事务上下文。事务上下文记录参与事务的数据库分区或参与者及其通信状态。上下文由协调器保存,以便它持久地维护事务的状态。

When a database client starts a transaction (e.g., the BEGIN TRANSACTION statement in Example 12-1), a coordinator is selected. The coordinator allocates a globally unique transaction identifier (tid) and returns this to the client. The tid identifies a data structure maintained by the coordinator known as the transaction context. The transaction context records the database partitions, or participants, that take part in the transaction and the state of their communications. The context is persisted by the coordinator, so that it durably maintains the state of the transaction.

然后,客户端执行事务定义的操作,将tid传递给执行数据库操作的每个参与者。每个参与者获取变异对象的锁并在本地执行操作。它还将tid与本地事务日志中的更新持久关联。这些数据库更新在此阶段尚未完成——只有在事务提交时才会发生。

The client then executes the operations defined by the transaction, passing the tid to each participant that performs the database operations. Each participant acquires locks on mutated objects and executes the operations locally. It also durably associates the tid with the updates in a local transaction log. These database updates are not completed at this stage—this only occurs if the transaction commits.

两阶段提交
图 12-1。两阶段提交

一旦事务中的所有操作都成功完成,客户端就会尝试提交事务。此时协调器上开始使用 2PC 算法,该算法会与参与者进行两轮投票:

Once all the operations in the transaction are completed successfully, the client tries to commit the transaction. This is when the 2PC algorithm commences on the coordinator, which drives two rounds of votes with the participants:

准备阶段
Prepare phase
协调员发送消息通知所有参与者,告诉他们准备提交交易。当参与者成功准备时,它保证可以提交事务并使其持久。此后,它不能再单方面决定中止交易。如果参与者无法准备,即无法保证提交事务,则必须中止。然后,每个参与者通过返回包含其决定的消息来通知协调器其提交或中止的决定。
The coordinator sends a message to all participants to tell them to prepare to commit the transaction. When a participant successfully prepares, it guarantees that it can commit the transaction and make it durable. After this, it can no longer unilaterally decide to abort the transaction. If a participant cannot prepare, that is, if it cannot guarantee to commit the transaction, it must abort. Each participant then informs the coordinator about its decision to commit or abort by returning a message that contains its decision.
解决阶段
Resolve phase
当所有参与者都拥有回复准备阶段后,协调员检查结果。如果所有参与者都可以提交,则整个事务可以提交,并且协调器向每个参与者发送提交消息。如果任何参与者决定必须中止事务,或者在一段时间内不回复协调者指定时间段内,协调者向每个参与者发送中止消息。
When all the participants have replied to the prepare phase, the coordinator examines the results. If all the participants can commit, the whole transaction can commit, and the coordinator sends a commit message to each participant. If any participant has decided that it must abort the transaction, or doesn’t reply to the coordinator within a specified time period, the coordinator sends an abort message to each participant.

2PC 故障模式

2PC Failure Modes

2PC有两种主要的失效模式。这些是参与者失败和协调者失败。与往常一样,故障可能是由系统崩溃或与应用程序的其余部分分区引起的。从2PC的角度来看,崩溃和分区是没有区别的:

2PC has two main failure modes. These are participant failure and coordinator failure. As usual, failures can be caused by systems crashing, or being partitioned from the rest of the application. From the perspective of 2PC, the crashes and partitions are indistinguishable:

参与者失败
Participant failure
当参与者在准备阶段完成之前崩溃时,事务将被中止协调员。这是一个简单的失败场景。参与者也有可能回复准备消息然后失败。无论哪种情况,当参与者重新启动时,它都需要与协调者通信以发现事务结果。协调者可以使用其事务日志来查找结果并相应地通知恢复的参与者。然后参与者在本地完成交易。从本质上讲,参与者的失败不会威胁到一致性,因为会达到正确的交易结果。
When a participant crashes before the prepare phase completes, the transaction is aborted by the coordinator. This is a straightforward failure scenario. It’s also possible for a participant to reply to the prepare message and then fail. In either case, when the participant restarts, it needs to communicate with the coordinator to discover transaction outcomes. The coordinator can use its transaction log to look up the outcomes and inform the recovered participant accordingly. The participant then completes the transaction locally. Essentially then, participant failure doesn’t threaten consistency, as the correct transaction outcome is reached.
协调器故障
Coordinator failure
如果协调者在发送准备消息后失败,参与者就会陷入困境。参加者投票同意提交的人必须阻塞,直到协调者通知他们交易结果。如果协调器在发送提交消息之前或期间崩溃,则参与者将无法继续,因为协调器已发生故障,并且在恢复之前不会发送事务结果。如图 12-2所示,协调器在收到准备阶段的参与者响应后崩溃。
Should the coordinator fail after sending the prepare message, participants have a dilemma. Participants that have voted to commit must block until the coordinator informs them of the transaction outcome. If the coordinator crashes before or during sending out the commit messages, participants cannot proceed, as the coordinator has failed and will not send the transaction outcome until it recovers. This is illustrated in Figure 12-2, where the coordinator crashes after receiving the participant responses from the prepare phase.
协调者故障导致交易结果不确定且参与者阻塞
图 12-2。协调者故障导致交易结果不确定且参与者阻塞

这个问题没有简单的解决办法。参与者无法自主决定提交,因为它不知道其他参与者如何投票。如果一个参与者投票回滚,而其他参与者投票提交,这将违反事务语义。唯一实际的解决方案是参与者等待协调器恢复并检查其事务日志。6该日志使协调器能够解决所有未完成的事务。如果它记录了未完成事务的提交条目,它将通知参与者提交。否则,它将回滚事务。

There is no simple resolution to this problem. A participant cannot autonomously decide to commit as it does not know how other participants voted. If one participant has voted to roll back, and others to commit, this would violate transaction semantics. The only practical resolution is for participants to wait until the coordinator recovers and examines its transaction log.6 The log enables the coordinator to resolve all incomplete transactions. If it has logged a commit entry for an incomplete transaction, it will inform the participants to commit. Otherwise, it will roll back the transaction.

事务协调器恢复和事务日志可以完成不完整的事务并确保系统保持一致性。缺点是参与者必须在协调者恢复时进行阻塞。这需要多长时间取决于实现,但可能至少是几秒钟。这会对可用性产生负面影响。

Transaction coordinator recovery and the transaction log make it possible to finalize incomplete transactions and ensure the system maintains consistency. The downside is that participants must block while the coordinator recovers. How long this takes is implementation dependent, but is likely to be at least a few seconds. This negatively impacts availability.

此外,在此期间,参与者必须持有事务变异的数据对象的锁。锁对于确保事务隔离是必要的。如果其他并发事务尝试访问这些锁定的数据项,它们将被阻止。这会导致响应时间增加,并可能导致请求超时。在负载较重的系统中或在请求高峰期间,这可能会导致级联故障、断路器打开以及其他通常不希望出现的结果,具体取决于系统设计的特征。

In addition, during this time, participants must hold locks on the data objects mutated by the transaction. The locks are necessary to ensure transaction isolation. If other concurrent transactions try to access these locked data items, they will be blocked. This results in increased response times and may cause requests to time out. In heavily loaded systems or during request spikes, this can cause cascading failures, circuit breakers to open, and other generally undesirable outcomes depending on the characteristics of the system design.

综上所述,2PC的弱点在于它不能容忍协调器故障。与所有单点故障问题一样,解决此问题的一种可能方法是在参与者之间复制协调器和事务状态。如果协调者发生故障,参与者可以晋升为协调者并完成事务。服用这条路径导致了一个需要分布式共识算法的解决方案,正如我在下一节中描述的那样。

In summary, the weakness of 2PC is that it is not tolerant of coordinator failure. One possible way to fix this, as with all single point of failure problems, is to replicate the coordinator and transaction state across participants. If the coordinator fails, a participant can be promoted to coordinator and complete the transaction. Taking this path leads to a solution that requires a distributed consensus algorithm, as I describe in the next section.

分布式共识算法

Distributed Consensus Algorithms

实现副本一致性,以便所有客户端看到数据对象副本值的一致视图需要对每个副本值达成共识或一致。对象副本的所有更新必须以相同的顺序应用于每个副本。要实现这一点需要分布式共识算法。

Implementing replica consistency such that all clients see a consistent view of a data object’s replica values requires consensus, or agreement, on every replica value. All updates to replicas for an object must be applied in the same order at every replica. Making this possible requires a distributed consensus algorithm.

在过去 40 年左右的时间里,人们在分布式共识算法上投入了大量的智力工作。虽然共识在概念上很简单,但事实证明会出现许多微妙的问题,因为参与者之间的消息可能会丢失或延迟,并且参与者可能会在不方便的时候崩溃。

Much intellectual effort has been devoted to distributed consensus algorithms in the last 40 years or so. While consensus is simple conceptually, it turns out many subtle problems arise because messages between participants can be lost or delayed, and participants can crash at inconvenient times.

作为需要达成共识的一个例子,想象一下在线拍卖结束时,当提交多个最后一秒出价时会发生什么。这相当于多个客户端发送可由同一拍卖数据对象的不同副本处理的更新请求。在最终一致的系统中,这可能会导致副本具有不同的出价值,并可能导致最高出价的损失。

As an example of the need for consensus, imagine what could happen at the end of an online auction when multiple last second bids are submitted. This is equivalent to multiple clients sending update requests that can be handled by different replicas of the same auction data object. In an eventually consistent system, this could lead to replicas with different bid values and potentially the loss of the highest bid.

共识算法可以确保此类问题不会发生。进一步来说:

A consensus algorithm makes sure such problems cannot occur. More specifically:

  • 所有副本必须就相同的中标出价达成一致。这是正确性(或安全性)属性。安全特性确保不会发生任何不良情况。在这种情况下,两次中标就很糟糕了。

  • All replicas must agree on the same winning bid. This is a correctness (or safety) property. Safety properties ensure nothing bad happens. In this case, two winning bids would be bad.

  • 最终选出一个中标者。这是一个活跃的属性。活跃度确保好事发生并且系统取得进展。在这种情况下,最终会就单一中标投标达成共识。保证活性的共识算法称为容错共识算法。

  • A single winning bid is eventually selected. This is a liveness property. Liveness ensures something good happens and the system makes progress. In this case consensus is eventually reached on a single winning bid. Consensus algorithms that guarantee liveness are known as fault-tolerant consensus algorithms.

  • 中标投标是已提交的投标之一。这确保了算法不能简单地通过硬编码来同意预定值。

  • The winning bid is one of the bids that was submitted. This ensures the algorithm can’t simply be hardcoded to agree on a predetermined value.

容错共识方法的基础是一类称为原子广播、全序广播或复制状态机的算法。7这些保证将一组值或状态以相同的顺序一次性传送到多个节点。2PC也是一种共识算法。然而,正如我在本章前面所解释的,它不具有容错能力,因为当事务协调器或领导者失败时它无法取得进展。

The basis of fault-tolerant consensus approaches are a class of algorithms called atomic broadcast, total order broadcast, or replicated state machines.7 These guarantee that a set of values, or states, are delivered to multiple nodes exactly once, and in the same order. 2PC is also a consensus algorithm. However, as I explained earlier in this chapter, it is not fault tolerant as it cannot make progress when the transaction coordinator, or leader, fails.

存在许多著名的共识算法。例如,Raft是一种基于领导者的原子广播算法。8单个领导者接收客户端请求,建立其顺序,并向追随者执行原子广播,以确保更新顺序一致。

A number of well-known consensus algorithms exist. For example, Raft is a leader-based atomic broadcast algorithm.8 A single leader receives clients requests, establishes their order, and performs an atomic broadcast to the followers to ensure a consistent order of updates.

相比之下,Leslie Lamport 的 Paxos 可能最著名的共识算法是无领导的。这与其他复杂性一起使得实施起来非常困难。9因此,开发了一种称为 Multi-Paxos 10的变体。Multi-Paxos 与 Raft 等基于领导者的方法有很多共同点,并且是 Google Cloud Spanner 等分布式关系数据库的实现基础。

In contrast, Leslie Lamport’s Paxos, probably the best known consensus algorithm, is leaderless. This, along with other complexities, make it notoriously tricky to implement.9 As a consequence, a variant known as Multi-Paxos10 was developed. Multi-Paxos has much in common with leader-based approaches like Raft and is the basis of implementations in distributed relational databases like Google Cloud Spanner.

为了实现容错,共识算法必须在领导者和追随者都失败的情况下取得进展。当领导者失败时,必须选举一位新领导者,并且所有追随者必须就同一位领导者达成一致。新的领导者选举方法因算法而异,但其核心要求:

To be fault tolerant, a consensus algorithm must make progress in the event of both leader and follower failures. When a leader fails, a single new leader must be elected and all followers must agree on the same leader. New leader election approaches vary across algorithms, but at their core they require:

  • 检测失败的领导者

  • Detection of the failed leader

  • 一名或多名追随者提名自己为领导者

  • One or more followers to nominate themselves as leaders

  • 可能进行多轮投票来选择新的领导者

  • Voting, with potentially multiple rounds, to select a new leader

  • 一种恢复协议,确保在选举新领导者后所有副本达到一致状态

  • A recovery protocol to ensure all replicas attain a consistent state after a new leader is elected

当然,关注者也可能不可用。因此,容错共识算法被设计为仅在法定人数或大多数参与者的情况下运行。法定人数既用于确认原子广播又用于领导者选举。只要参与节点达到法定人数并达成一致,算法就可以取得进展。我将在下面的小节中以 Raft 算法为例更详细地探讨这些问题。

Of course, followers may also be unavailable. Fault-tolerant consensus algorithms are therefore designed to operate with just a quorum, or majority, of participants. Quorums are used both for acknowledging atomic broadcasts and for leader election. As long as a quorum of the participating nodes are available and agree, the algorithm can make progress. I’ll explore these issues in more detail in the following subsections, which use the Raft algorithm as an example.

Raft

Raft 被设计为直接应对 Paxos 算法固有的复杂性。它被称为“一种可理解的共识算法”,于 2013 年首次发布。11重要的是,还发布了参考实现。这提供了 Raft 中概念的具体描述,并作为实现者在自己的系统中利用的基础。

Raft was designed as a direct response to the complexity inherent in the Paxos algorithm. Termed “an understandable consensus algorithm,” it was first published in 2013.11 Importantly, a reference implementation was also published. This provides a concrete description of the concepts in Raft, and acts as a basis for implementers to leverage in their own systems.

Raft 是一种基于领导者的算法。领导者接受所有更新并定义执行顺序。然后,它负责按定义的顺序将这些更新发送到所有副本,以便所有副本保持相同的提交状态。更新以日志的形式维护,Raft 本质上将该日志复制到系统的所有成员。

Raft is a leader-based algorithm. The leader accepts all updates and defines an order for their execution. It then takes responsibility for sending these updates to all replicas in the defined order, such that all replicas maintain identical committed states. The updates are maintained as a log, and Raft essentially replicates this log to all members of the system.

Raft集群具有奇数个节点,例如三个或五个。这使得共识能够基于法定人数进行。在任何时刻,每个节点要么是领导者,要么是跟随者,或者如果检测到领导者故障则成为领导者候选者。领导者定期向追随者发送心跳消息,以表明它仍然活着。基本Raft集群架构的消息流程如图12-3所示。领导者心跳的时间段通常约为 300-500 毫秒。

A Raft cluster has an odd number of nodes, for example, three or five. This enables consensus to proceed based on quorums. At any instant, each node is either a leader, a follower, or a candidate for leader if a leader failure has been detected. The leader sends periodic heartbeat messages to followers to signal that it is still alive. The message flow in a basic Raft cluster architecture is shown in Figure 12-3. The time period for leader heartbeats is typically around 300–500 milliseconds.

每个领导者都与一个单调递增的值(称为术语)相关联。该术语是一个逻辑时钟,每个有效的术语值都与一个领导者相关联。当前术语值由集群中的每个节点在本地保存,并且对于领导者选举至关重要,我将很快解释这一点。每个心跳消息包含当前术语值和领导者身份,并使用消息传递AppendEntries()AppendEntries()还用于传递新条目以提交到日志中。在空闲期间,当领导者没有来自客户端的新请求时,一个空AppendEntries()就足以作为心跳。

Each leader is associated with a monotonically increasing value known as a term. The term is a logical clock, and each valid term value is associated with a single leader. The current term value is persisted locally by every node in the cluster, and is essential for leader election, as I’ll soon explain. Each heartbeat message contains the current term value and leader identity and is delivered using an AppendEntries() message. AppendEntries() is also utilized to deliver new entries to commit on the log. During idle periods when the leader has no new requests from clients, an empty AppendEntries() simply suffices as the heartbeat.

Raft 集群中的消息交换,有一个领导者和两个追随者
图 12-3。Raft 集群中的消息交换,有一个领导者和两个追随者

在正常操作期间,所有客户端更新都会发送给领导者。领导者订购更新并将其附加到本地日志中。最初,所有日志条目都标记为未提交。然后,领导者使用消息将更新发送给所有追随者AppendEntries,该消息还标识日志中更新的术语和位置。当追随者收到此消息时,它将更新保留为未提交的本地日志,并向领导者发送确认。一旦领导者收到大多数追随者的积极确认,它就会将更新标记为已提交,并将决定传达给所有追随者。

During normal operations, all client updates are sent to the leader. The leader orders the updates and appends them to a local log. Initially, all log entries are marked as uncommitted. The leader then sends the updates to all followers using an AppendEntries message, which also identifies the term and the position of the updates in the log. When a follower receives this message, it persists the update to its local log as uncommitted and sends an acknowledgment to the leader. Once the leader has received positive acknowledgments from a majority of followers, it marks the update as committed and communicates the decision to all followers.

该协议如图 12-4所示。日志条目 1 和 2 已在所有三个副本上提交,并且相应的变更将应用于数据库分区以对客户端可见。日志条目 3 仅提交给领导者和一名追随者。关注者 1最终将提交此更新。

This protocol is depicted in Figure 12-4. Log entries 1 and 2 are committed on all three replicas, and the corresponding mutations are applied to the database partitions to become visible to clients. Log entry 3 is only committed on the leader and one follower. Follower 1 will eventually commit this update.

客户端还向领导者发送了更新,由日志条目 4 和 5 表示。领导者将这些更新写入其本地日志并将其标记为未提交。然后,它将AppendEntries()向关注者发送消息,如果没有发生异常,关注者将确认这些更新,并且它们将在所有副本上提交。

Clients also have sent updates to the leader represented by log entries 4 and 5. The leader writes these to its local log and marks them as uncommitted. It will then send AppendEntries() messages to the followers and if no exceptions occur, followers will acknowledge these updates and they will be committed at all the replicas.

使用 Raft 进行日志复制
图 12-4。使用 Raft 进行日志复制

只需要大多数关注者就可以在日志上提交条目。这意味着每个关注者在任何时刻提交的日志条目可能都不相同。如果追随者落后或分区,并且不确认AppendEntries请求,则领导者将继续重新发送消息,直到追随者响应。可以使用消息中的术语和序列号来识别发送给关注者的重复消息并安全地丢弃。

Only a majority of followers are required to commit an entry on the log. This means the committed log entries may not be identical at every follower at any instant. If a follower falls behind or is partitioned, and is not acknowledging AppendEntries requests, the leader continues to resend messages until the follower responds. Duplicated messages to followers can be recognized using the term and sequence numbers in the messages and safely discarded.

领导人选举

Leader Election

Raft 中的领导者定期发送向关注者发送心跳消息。每个追随者维护一个选举计时器,在收到心跳消息后启动该计时器。如果计时器在收到另一个心跳之前到期,则跟随者将开始选举。选举计时器是随机的,以最大限度地减少多个追随者同时超时并召集选举的可能性。

The leader in Raft sends periodic heartbeat messages to followers. Each follower maintains an election timer, which it starts after receiving a heartbeat message. If the timer expires before another heartbeat is received, the follower starts an election. Election timers are randomized to minimize the likelihood that multiple followers time out simultaneously and call an election.

如果追随者的选举超时到期,它会将其状态更改为候选者,增加选举期限值,并向RequestVote所有节点发送消息。它还为自己投票。该RequestVote消息包含候选者的标识符、新术语值以及有关候选者日志中已提交条目的状态的信息。然后,候选人等待收到回复。如果它收到大多数赞成票,它将转变为领导者,并开始发送心跳以通知集群中的其他节点有关其新获得的状态。如果未获得多数选票,它仍然是候选人并重置其选举计时器。

If a follower’s election timeout expires, it changes its state to candidate, increments the election term value, and sends a RequestVote message to all nodes. It also votes for itself. The RequestVote message contains the candidate’s identifier, the new term value, and information about the state of the committed entries in the candidate’s log. The candidate then waits until it receives replies. If it receives a majority of positive votes, it will transition to leader, and start sending out heartbeats to inform the other nodes in the cluster about its newly acquired status. If a majority of votes are not received, it remains a candidate and resets its election timer.

当关注者收到RequestVote消息时,他们会执行以下操作之一:

When followers receive a RequestVote message, they perform one of the following actions:

  • 如果传入消息中的术语大于本地保留的术语,并且候选人的日志至少与关注者的日志一样最新,则它会投票给候选人。

  • If the term in the incoming message is greater than the locally persisted term, and the candidate’s log is at least as up to date as the follower’s, it votes for the candidate.

  • 如果该任期小于或等于本地任期,或者追随者的日志已提交了候选人日志中不存在的日志条目,则它会拒绝领导请求。

  • If the term is less than or equal to the local term, or the follower’s log has committed log entries that are not present in the candidate’s log, it denies the leadership request.

例如,图 12-4中的Follower 1无法成为领导者,因为其提交的日志条目不是最新的。Follower 2确实拥有所有已提交的日志条目,并且可以成为领导者。为了说明这一点,图 12-5显示了Follower-2在其选举计时器到期时如何转换为领导者。

For example, Follower 1 in Figure 12-4 could not become leader as its committed log entries are not up to date. Follower 2 does have all committed log entries and could become leader. To illustrate this, Figure 12-5 shows how Follower-2 can transition to leader when its election timer expires.

Raft 中的 Leader 选举
图 12-5。Raft 中的 Leader 选举

Raft 领导者选举的这些条件确保任何当选的领导者在其日志中都拥有之前任期的所有已提交条目。如果候选人的日志中没有所有已提交的条目,则它无法收到更多最新关注者的积极投票。然后候选人将退缩,另一次选举将开始,最终拥有最新日志条目的候选人将获胜。

These conditions on Raft’s leader election ensure that any elected leader has all the committed entries from previous terms in its log. If a candidate does not have all committed entries in its log, it cannot receive a positive vote from more up-to-date followers. The candidate will then back down, another election will be started, and eventually a candidate with the most up-to-date log entries will win.

两个或多个追随者的选举计时器也有可能同时到期。当这种情况发生时,每个关注者将转换为候选者,增加术语并发送RequestVote消息。Raft 强制执行一条规则,即任何节点在单个任期内只能投票一次。因此,当多个候选人开始选举时:

It’s also possible for the election timers of two or more followers to expire simultaneously. When this happens, each follower will transition to a candidate, increment the term, and send RequestVote messages. Raft enforces a rule whereby any node can only vote once within a single term. Hence, when multiple candidates start an election:

  • 一个人可能会获得多数选票并赢得选举。

  • One may receive a majority of votes and win an election.

  • 没有人可以获得多数票。在这种情况下,候选人会重置其选举计时器,并开始另一次选举。最终将选出一位领导人。

  • None may receive a majority. In this case, candidates reset their election timers and another election will be initiated. Eventually a leader will be elected.

Raft 由于其相对简单性而引起了相当大的兴趣。它在多个生产系统中实施需要达成共识。其中包括 Neo4j 和 YugabyteDB 数据库、etcd 键值存储以及分布式内存对象存储 Hazelcast 等数据库。

Raft has attracted considerable interest due to its relative simplicity. It is implemented in multiple production systems that require consensus. These include databases such as the Neo4j and YugabyteDB databases, the etcd key-value store, and Hazelcast, a distributed in-memory object store.

实践中的强一致性

Strong Consistency in Practice

自 2011 年左右首次创造NewSQL 一词以来,分布式 SQL 数据库经历了快速发展。这些数据库支持强一致性的方式在此类技术中差异很大,因此需要深入研究通常模糊的细节来理解所提供的一致性保证。在下面的两节中,我将简要强调两个当代例子所采取的不同方法。

Distributed SQL databases have undergone a rapid evolution since around 2011, when the term NewSQL was first coined. The manner in which these databases support strong consistency varies quite considerably across this class of technologies, so it pays to dig into the often-murky details to understand the consistency guarantees provided. In the following two sections, I’ll briefly highlight the different approaches taken by two contemporary examples.

伏特数据库

VoltDB

VoltDB是最初的 NewSQL 数据库之一。它建立在无共享的基础上架构,其中关系表使用分区键进行分片并跨节点复制。通过在内存中维护表并将数据快照异步写入磁盘来实现低延迟。这将数据库大小限制为 VoltDB 节点集群中可用的总内存。VoltDB 的主要部署是在电信行业。

VoltDB is one of the original NewSQL databases. It is built upon a shared-nothing architecture, in which relational tables are sharded using a partition key and replicated across nodes. Low latencies are achieved by maintaining tables in memory and asynchronously writing snapshots of the data to disk. This limits the database size to the total memory available in the cluster of VoltDB nodes. The primary deployments of VoltDB are in the telecommunication industry.

每个 VoltDB 表分区都与一个 CPU 核心关联。核心负责执行其关联分区上的所有读写请求,并且这些请求是有序的由在核心上运行的单分区启动器 (SPI) 进程顺序执行。这意味着每个核心都以严格的单线程方式在其关联分区上执行数据库请求。单线程执行减轻了争用问题和锁定的开销,并且是促进 VoltDB ACID 一致性支持的重要机制。分区的 SPI 还确保每个分区副本的写入请求以相同的顺序执行。

Each VoltDB table partition is associated with a single CPU core. A core is responsible for executing all read and write requests at its associated partitions, and these are ordered sequentially by a Single Partition Initiator (SPI) process that runs on the core. This means each core executes database requests on its associated partitions in a strict single-threaded manner. Single-threaded execution alleviates contention concerns and the overheads of locking, and is an important mechanism that facilitates VoltDB’s ACID consistency support. The SPI for a partition also ensures write requests are executed in the same order for each partition replica.

客户端以 SQL 存储过程的形式提交请求。存储过程被视为事务单元。当客户端请求到达 VoltDB 时,SQL 查询分析器会根据数据库架构以及表可用的分区键和索引生成执行计划。根据此执行计划,VoltDB 将请求发送到查询需要访问的一个或多个分区。

Clients submit requests as SQL stored procedures. A stored procedure is regarded as a transactional unit. When a client request arrives at VoltDB, the SQL query analyzer generates an execution plan based on the database schema and the partition keys and indexes available for the tables. Based on this execution plan, VoltDB sends requests to the partition or partitions that the query needs to access.

重要的是,VoltDB 将查询传递到每个分区副本,以便以完全相同的顺序执行。与分区关联的 SPI 只是接受本地命令日志中的请求并一次执行一个请求,如图12-6所示。查询分析器确定存储过程希望访问哪个表。然后,它分派由与执行事务所需的表分区关联的 CPU 核心串行执行的存储过程。

Importantly, VoltDB delivers queries to each partition replica for execution in exactly the same order. The SPI associated with a partition simply accepts requests into a local command log and executes them one at a time, as illustrated in Figure 12-6. The query analyzer determines which table a stored procedure wishes to access. It then dispatches the stored procedures to be executed serially by the CPU core that is associated with the table partitions necessary to execute the transaction.

VoltDB单分区事务执行架构
图 12-6。VoltDB单分区事务执行架构

这对于写入事务具有重要影响,具体取决于事务是否会改变一个或多个分区中的数据。如果事务仅修改单个分区中的数据,如图12-6所示,它可以在每个SPI上执行并在每个副本上不受阻碍地提交。由于 VoltDB 以完全相同的顺序发送事务以在每个分区副本上执行,这保证了可串行化,而不需要数据对象锁定和 2PC。简而言之,在单线程系统中您无需考虑隔离问题。因此,单分区事务可以以极低的延迟执行。

This has important implications for write transactions, based on whether the transaction mutates data in one or multiple partitions. If a transaction only modifies data in a single partition, as in Figure 12-6, it can execute at each SPI and commit unimpeded at each replica. As VoltDB sends transactions to execute at each partition replica in exactly the same order, this guarantees serializability without the need for data object locking and 2PC. Simply, you don’t have isolation concerns in a single-threaded system. Hence, single partition transactions can execute with extremely low latency.

但是,如果查询规划器确定当一个事务改变两个或多个分区中的数据时,VoltDB 会发送跨多个核心进行协调的请求。集群范围内的多分区启动器 (MPI) 充当协调器并驱动 2PC 算法,以确保事务在所有分区上自动提交或中止。这会带来更高的开销,从而降低多分区事务的性能。

However, if the query planner determines a transaction mutates data in two or more partitions, VoltDB sends the request for coordination across multiple cores. A cluster-wide Multi-Partition Initiator (MPI) acts as the coordinator and drives a 2PC algorithm to ensure the transaction commits or aborts atomically at all partitions. This introduces higher overheads and hence lower performance for multipartition transactions.

由于VoltDB是内存数据库,因此必须采取额外的措施来提供数据的安全性和持久性。您可以配置两种机制:定期命令日志记录和分区快照,以满足应用程序性能和安全要求,如下所述:

As VoltDB is an in-memory database, it must take additional measures to provide data safety and durability. You can configure two mechanisms, periodic command logging and partition snapshots, to meet application performance and safety requirements as described in the following:

  • 每个 SPI 将其命令日志中的条目写入持久存储。如果某个节点发生故障,VoltDB 可以通过读取该分区的最新快照并顺序执行命令日志中的命令来恢复该分区。命令日志的持久性因此有利于可恢复性。命令日志的保存频率由系统定义的间隔值控制。间隔越短(几毫秒的范围),如果节点崩溃,丢失更新的风险就越低。性能和安全性之间存在固有的权衡。

  • Each SPI writes the entries in its command log to persistent storage. If a node fails, VoltDB can restore the partition by reading the latest snapshot of the partition and sequentially executing the commands in the command log. Command log durability hence facilitates recoverability. The frequency with which the command log is persisted is controlled by a system-defined interval value. The shorter the interval (on the scale of a few milliseconds), the lower risk of losing updates if a node should crash. There’s an inherent trade-off here between performance and safety.

  • 每个分区还定义了一个快照间隔。这定义了本地分区数据写入磁盘的频率。通常,根据事务负载,该配置在秒到分钟范围内。

  • Each partition also defines a snapshot interval. This defines how often the local partition’s data is written to disk. Typically, this is configured in the seconds-to-minutes range, depending on transaction load.

这两个设置具有重要的相互作用。当VoltDB成功将分区写入持久存储时,命令日志可以被截断。这是因为命令日志中所有事务的结果在最新的分区快照中都是持久的,因此可以丢弃命令。

These two settings have an important interaction. When VoltDB successfully writes a partition to persistent storage, the command log can be truncated. This is because the outcome of all the transactions in the command log are durable in the latest partition snapshot, and hence the commands can be discarded.

最后,自版本 6.4 起,VoltDB 支持线性化,因此在相同的环境中支持最强的一致性级别数据库集群。VoltDB 实现了线性化,因为它对所有分区的写入顺序达成共识,并且事务不会交错,因为它们是按顺序执行的。然而,在此版本之前,由于只读事务没有与写入事务严格排序,并且可能由过时的副本提供服务,因此可能会出现过时读取的情况。此问题的根本原因是尝试跨分区读取负载平衡的优化。您可以在 Jepsen 网站上阅读有关暴露这些问题的测试和修复的所有详细信息。12

Finally, since version 6.4, VoltDB supports linearizability, and hence the strongest consistency level, within the same database cluster. VoltDB achieves linearizability because it reaches consensus on the order of writes at all partitions, and transactions do not interleave because they are executed sequentially. However, up until this version, stale reads were possible as read-only transactions were not strictly ordered with write transactions, and could be served by out-of-date replicas. The root cause of this issue was an optimization that tried to load balance reads across partitions. You can read all about the details of the tests that exposed these problems and the fixes at the Jepsen website.12

谷歌云扳手

Google Cloud Spanner

2013年,Google发布了Spanner数据库纸。13 Spanner 被设计为一个高度一致的、全球分布的 SQL 数据库。Google 将这种强一致性称为外部一致性。本质上,从程序员的角度来看,Spanner 的行为与单机数据库没有什么区别。Spanner 通过 Cloud Spanner 服务向 Google 客户公开。Cloud Spanner 是一个基于云的数据库即服务 (DBaaS) 平台。

In 2013, Google published the Spanner database paper.13 Spanner is designed as a strongly consistent, globally distributed SQL database. Google refers to this strong consistency as external consistency. Essentially, from the programmer’s perspective, Spanner behaves indistinguishably from a single machine database. Spanner is exposed to Google clients through the Cloud Spanner service. Cloud Spanner is a cloud-based database as a service (DBaaS) platform.

为了横向扩展,Cloud Spanner 对数据库进行分区表分成碎片(分片)。拆分包含表的连续键范围,一台机器可以托管多个拆分。分区还可以跨多个可用区进行复制,以提供容错能力。Cloud Spanner 使用 Paxos 共识算法保持副本一致性。与 Raft 一样,Paxos 使一组副本能够就一系列更新的顺序达成一致。Cloud Spanner Paxos 实现拥有长期选举的领导者,并根据副本集的多数投票提交副本更新。

To scale out, Cloud Spanner partitions database tables into splits (shards). Splits contain a contiguous key range for a table, and one machine can host multiple splits. Splits are also replicated across multiple availability zones to provide fault tolerance. Cloud Spanner keeps replicas consistent using the Paxos consensus algorithm. Like Raft, Paxos enables a set of replicas to agree on the order of a sequence of updates. The Cloud Spanner Paxos implementation has long-lived elected leaders and commits replica updates upon a majority vote from the replica set.

Cloud Spanner 对程序员隐藏了表分区的详细信息。随着数据量的增长或收缩,它将在机器之间动态地重新分区数据,并将数据迁移到新位置以平衡负载。API处理用户请求。这利用优化的容错查找服务来查找托管查询访问的键范围的计算机。

Cloud Spanner hides the details of table partitioning from the programmer. It will dynamically repartition data across machines as data volumes grow or shrink and migrate data to new locations to balance load. An API layer processes user requests. This utilizes an optimized, fault tolerant lookup service to find the machines that host the key ranges a query accesses.

Cloud Spanner 支持 ACID 事务。如果事务仅更新单个拆分中的数据,则该拆分的 Paxos 领导者将处理该请求。它首先获取已修改行的锁,并将突变传达给每个副本。当大多数副本投票提交时,领导者会同时响应客户端并告诉副本将更改应用到持久存储。

Cloud Spanner supports ACID transactions. If a transaction only updates data in a single split, the Paxos leader for the split processes the request. It first acquires locks on the rows that are modified, and communicates the mutations to each replica. When a majority of replicas vote to commit, in parallel the leader responds to the client and tells the replicas to apply the changes to the persistent storage.

在多个分片中修改数据的事务更加复杂,并且会产生更多开销。当客户端尝试提交事务时,它会选择修改后的分片之一的领导者作为事务协调器来驱动 2PC 算法。其他分裂领导者成为交易的参与者。该架构如图 12-7所示。采购表领导者被选为 2PC 协调者,并且它作为 2PC 参与者与修改后的Stock WestStock East表拆分中的领导者进行通信。Cloud Spanner 使用 Paxos 确保每个副本组内的副本更新顺序达成共识。

Transactions that modify data in multiple splits are more complex, and incur more overhead. When the client attempts to commit the transaction, it selects the leader of one of the modified splits as the transaction coordinator to drive a 2PC algorithm. The other split leaders become participants in the transaction. This architecture is depicted in Figure 12-7. The Purchases table leader is selected as the 2PC coordinator, and it communicates with the leaders from the modified Stock West and Stock East table splits as 2PC participants. Cloud Spanner uses Paxos to ensure consensus on the order of replica updates within each replica group.

云扳手 2PC
图 12-7。云扳手 2PC

协调者将客户端请求传达给每个参与者。由于每个参与者都是拆分的 Paxos 领导者,因此它会获取大多数拆分副本上修改的行的锁。当所有参与者确认他们已经获得了必要的锁时,协调器选择一个提交时间戳并告诉参与者进行提交。参与者随后将提交决策和时间戳传达给他们的每个副本,并且所有副本将更新应用到数据库。如果参与者无法准备提交,协调者会指示所有参与者中止事务。

The coordinator communicates the client request to each participant. As each participant is the Paxos leader for the split, it acquires locks for the rows modified on a majority of split replicas. When all participants confirm they have acquired the necessary locks, the coordinator chooses a commit timestamp and tells the participants to commit. The participants subsequently communicate the commit decision and timestamp to each of their replicas, and all replicas apply the updates to the database. Should a participant be unable to prepare to commit, the coordinator directs all participants to abort the transaction.

重要的是,2PC 实现表现为 Paxos 组。协调者使用 Paxos 将事务的状态复制给参与者。如果协调者失败,其中一名参与者可以接任领导者并完成交易。这消除了我在本章前面描述的协调器故障导致事务阻塞的问题,但代价是使用 Paxos 进行额外的协调。

Importantly, the 2PC implementation behaves as a Paxos group. The coordinator replicates the state of the transaction to the participants using Paxos. Should the coordinator fail, one of the participants can take over as leader and complete the transaction. This eliminates the problem I described earlier in this chapter of coordinator failure leading to blocked transactions, at the cost of additional coordination using Paxos.

Cloud Spanner 还支持事务的线性化。这基本上意味着,如果事务 T1 在事务 T2 之前提交,那么事务 T2 只能稍后提交,从而强制执行实时排序。T2提交后也可以观察T1的结果。

Cloud Spanner also supports linearizability of transactions. This basically means that if transaction T1 commits before transaction T2, then transaction T2 can only commit at a later time, enforcing real-time ordering. T2 can also observe the results of T1 after it commits.

图 12-8演示了它在 Spanner 中的工作原理。事务 T1 读取并修改数据对象 (x)。然后它成功提交,并且提交发生在时间 t1。事务 T2 在 T1 之后但在 T1 提交之前开始。T2 读取并修改数据对象 (y),然后读取并修改 (x),最后在时间 t2 提交。当 T2 读取 (x) 时,它会看到 T1 对 (x) 的影响,因为读取发生在 T1 提交之后。

Figure 12-8 demonstrates how this works in Spanner. Transaction T1 reads and modifies data object (x). It then successfully commits, and the commit occurs at time t1. Transaction T2 starts after T1 but before T1 commits. T2 reads and modifies data object (y), then reads and modifies (x), and finally commits at time t2. When T2 reads (x), it sees the effects of T1 on (x) as the read occurs after T1 commits.

Cloud Spanner 使用事务的提交时间来为事务范围内修改的所有对象添加时间戳。这意味着交易的所有影响似乎都在同一时刻发生。此外,事务的顺序反映在提交时间戳中,如 t1 < t2。

Cloud Spanner uses the commit time for a transaction to timestamp all the objects modified within the transaction scope. This means all the effects of a transaction appear to have occurred at exactly the same instant in time. In addition, the order of the transactions is reflected in the commit timestamps, as t1 < t2.

Cloud Spanner 中事务的线性化
图 12-8。Cloud Spanner 中事务的线性化

实现线性化需要跨所有节点的可靠时间源。14使用 NTP 风格的时间服务无法实现这一点,因为节点之间的时钟偏差可能约为几百毫秒。从图 12-8中可以看出,如果 T2 使用的时间源晚于 T1,则事务 T2 可能会比事务 T1 更早提交。

Achieving linearizability requires a reliable time source across all nodes.14 This is not possible using the NTP-style time services, as clock skew across nodes can be of the order of a few hundred milliseconds. From Figure 12-8, transaction T2 may commit at an earlier time than transaction T1 if T2 is using a time source that is behind that of T1.

Cloud Spanner 针对这个问题实现了独特的解决方案,即 TrueTime 服务。TrueTime 为 Google 数据中心配备了卫星连接的 GPS 和原子钟,并提供了具有已知上限时钟偏差(据报道约为 7 毫秒)的紧密同步时钟。Spanner 中的所有数据对象都与 TrueTime 时间戳相关联,该时间戳表示上次更改对象的事务的提交时间。

Cloud Spanner implements a unique solution to this problem, namely the TrueTime service. TrueTime equips Google data centers with satellite connected GPS and atomic clocks, and provides closely synchronized clocks with a known upper bound clock skew, reportedly around 7 milliseconds. All data objects in Spanner are associated with a TrueTime timestamp that represents the commit time of the transaction that last mutated the object.

由于 TrueTime 仍然存在固有的(尽管很小)时钟偏差,因此 Cloud Spanner 引入了提交等待期。从 TrueTime 生成提交时间戳,然后协调器等待等于已知上限时钟偏差的时间段。通过引入这个等待期,所有事务都会被锁定,并且该事务变异的数据对于其他事务来说是不可见的,直到TrueTime保证在所有节点上报告更高的时间戳。这确保任何并发事务都将被锁阻塞,因此必须使用更高的提交时间戳,并且所有客户端将始终看到过去的提交时间戳。

As TrueTime still has an inherent, albeit small, clock skew, Cloud Spanner introduces a commit wait period. A commit timestamp is generated from TrueTime and the coordinator then waits for a period that is equal to the known upper bound clock skew. By introducing this wait period, all transaction locks are held and the data mutated by the transaction is not visible to other transactions until TrueTime is guaranteed to report a higher timestamp at all nodes. This ensures any concurrent transactions will be blocked on the locks and hence must use a higher commit timestamp, and all clients will always see commit timestamps that are in the past.

Cloud Spanner 中的强一致性还需要一种要素。由于更新由 Paxos 复制并在大多数节点同意时提交,因此客户端读取请求可能会访问尚未收到数据对象最新更新的副本。默认情况下,Cloud Spanner 提供强一致性读取。当副本接收到读取时,它会与 Paxos 领导者进行通信以进行副本拆分,并检查其是否具有读取访问的所有对象的最新值。同样,这种机制引入了开销以保证客户端不会看到过时的数据。

There’s one more ingredient needed for strong consistency in Cloud Spanner. As updates are replicated by Paxos and committed when a majority of nodes agree, it is possible for a client read request to access a replica that has not received the latest update for a data object. By default, Cloud Spanner provides strongly consistent reads. When a replica receives a read, it communicates with the Paxos leader for its replica split and checks it has the most up-to-date value for all objects accessed by the read. Again, this mechanism introduces overheads to guarantee clients do not see stale data.

Cloud Spanner 是 GCP 的一个组成部分。其客户群横跨金融服务、零售和游戏等行业,都被强大的一致性保证以及高可用性和全球分布式部署能力所吸引。有趣的是,Cloud Spanner 激发了基于 Spanner 架构的开源实现,但不需要定制 TrueTime 风格的硬件。当然,代价是一致性保证较低。15值得注意的例子是CockroachDBYugabyteDB

Cloud Spanner is an integral component of GCP. Its customer base spans industries such as financial services, retail, and gaming, all attracted by the strong consistency guarantees as well as high availability and globally distributed deployment capabilities. Interestingly, Cloud Spanner has inspired open source implementations based on the Spanner architecture, but which do not require custom TrueTime-style hardware. The trade-off, of course, is lower consistency guarantees.15 Notable examples are CockroachDB and YugabyteDB.

总结和延伸阅读

Summary and Further Reading

对于许多应用领域来说,可扩展、高可用、具有单机一致性保证和易于编程的分布式数据库是数据管理系统的圣杯。事实证明,建立这样一个数据库是相当困难的。需要结合额外的协调和共识机制来提供顺序系统所期望的数据一致性。这些数据库平台的正确构建非常复杂,而要实现高可用性和低响应时间则更加复杂。

For many application areas, a scalable and highly available distributed database with the consistency guarantees and ease of programming of a single machine is the holy grail of data management systems. Building such a database turns out to be rather difficult. Additional coordination and consensus mechanisms need to be incorporated to provide the data consistency expected of a sequential system. These database platforms are complex to build correctly and even more complex to make highly available and provide low response times.

一般来说,一致性是一个复杂的主题,数据库和分布式系统社区分别生成了过多的术语。在本章中,我重点关注每个社区的两个最强的一致性保证:可串行化和线性化,并解释了对于实现这些级别的一致性至关重要的共识算法。我以 VoltDB 和 Cloud Spanner 为例,展示了大规模分布式数据库如何利用这些算法以及创新的设计方法来实现强一致性。

Consistency in general is a complex topic, with overloaded terminology generated separately by the database and distributed systems communities. In this chapter, I’ve focused on the two strongest consistency guarantees from each community, serializability and linearizability, and explained consensus algorithms that are fundamental to achieving these levels of consistency. Using VoltDB and Cloud Spanner as examples, I’ve shown how distributed databases at scale utilize these algorithms along with innovative design approaches to achieve strong consistency.

分布式系统一致性仍然是活跃研究和创新的主题。Calvin 数据库系统体现了一种强一致性数据库的独特方法。16 Calvin 对事务进行预处理和排序,以便副本按相同的顺序执行它们。这称为确定性事务执行。它本质上减少了事务执行的协调开销,因为每个副本都看到相同的输入,因此将产生相同的输出。Fauna是 Calvin 架构最著名的数据库实现。

Distributed systems consistency remains a topic of active research and innovation. A unique approach for a strongly consistent database is embodied in the Calvin database system.16 Calvin preprocesses and sequences transactions so that they are executed by replicas in the same order. This is known as deterministic transaction execution. It essentially reduces the coordination overheads of transaction execution as every replica sees the same inputs and hence will produce the same outputs. Fauna is the most notable database implementation of the Calvin architecture.

如果您真的想深入了解一致性的世界,Jepsen 网站是一个很好的资源。对多个分布式数据库遵守承诺的一致性级别进行了大约 30 项详细分析。这些分析往往极具启发性,并揭露了承诺与现实并不总是相符的领域。

If you really want to deep dive into the world of consistency, the Jepsen website is a wonderful resource. There are around 30 detailed analyses of adherence to promised consistency levels for multiple distributed databases. These analyses are often extremely revealing and expose areas where promises don’t always meet reality.

1 James C. Corbett 等人,“Spanner:Google 的全球分布式数据库”。ACM 计算机系统汇刊 (TOCS) 31.3 (2013), 1–22。https://oreil.ly/QYX8y

1 James C. Corbett et al., “Spanner: Google’s Globally Distributed Database.” ACM Transactions on Computer Systems (TOCS) 31.3 (2013), 1–22. https://oreil.ly/QYX8y.

2 Jim Gray,“数据库操作系统注释”。在 R. Bayer 等人中。操作系统:高级课程。卷。60. 计算机科学讲义。柏林:施普林格,1978 年。

2 Jim Gray, “Notes on Database Operating Systems.” In R. Bayer et al. Operating Systems: An Advanced Course. Vol. 60. Lecture Notes in Computer Science. Berlin: Springer, 1978.

3莱斯利·兰波特,“兼职议会”。ACM 计算机系统汇刊16,编号。2(1998),133-69。https://doi.org/10.1145/279227.279229

3 Leslie Lamport, “The Part-Time Parliament.” ACM Transactions on Computer Systems 16, no. 2 (1998), 133–69. https://doi.org/10.1145/279227.279229.

4 YugabyteDB是一个分布式关系数据库。

4 YugabyteDB is a distributed relational database.

5对 XA 的支持是跨平台混合的,并且很少在大型系统中使用。如果您想了解更多信息,您确实写了一本书:Ian Gorton,企业事务处理系统:将 Corba OTS、Encina++ 和 OrbixOTM 投入工作(Addison-Wesley,2000)。

5 Support for XA is mixed across platforms, and it is rarely used in large-scale systems. If you want to learn more, yours truly wrote a book on it: Ian Gorton, Enterprise Transaction Processing Systems: Putting the Corba OTS, Encina++ and OrbixOTM to Work (Addison-Wesley, 2000).

6可以引入另一个阶段的投票来解决协调器故障时的 2PC 阻塞问题。这称为三阶段提交。然而,它给 2PC 中已有的开销增加了更多的开销,因此在实践中很少使用。

6 It is possible to introduce another phase of voting to get around the problem of 2PC blocking when the coordinator fails. This is known as a three-phase commit. However, it adds even more overheads to those already inherent in 2PC, and is hence rarely used in practice.

7复制的元素是导致复制状态机中的转换在每个副本上以相同顺序执行的命令。

7 The elements replicated are the commands which cause transitions in the replicated state machines to execute in the same order at each replica.

8 Diego Ongaro 和 John Ousterhout,“寻找可理解的共识算法”。2014 年 USENIX 年度技术会议 (USENIX ATC'14) 会议记录,305–320。美国:USENIX 协会。

8 Diego Ongaro and John Ousterhout, “In Search of an Understandable Consensus Algorithm.” In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC’14), 305–320. USA: USENIX Association.

9 Tushar D. Chandra 等人,“Paxos Made Live:工程视角”。第二十六届 ACM 分布式计算原理年度研讨会 (PODC '07) 会议记录,398–407。美国纽约州纽约市:计算机协会。

9 Tushar D. Chandra et al., “Paxos Made Live: An Engineering Perspective.” In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing (PODC ’07), 398–407. New York, NY, USA: Association for Computing Machinery.

10 Robbert Van Renesse 和 Deniz Altinbuken,“Paxos 变得相当复杂”,ACM 计算调查 47,第 10 期。3(2015),1-36。https://doi.org/10.1145/2673577

10 Robbert Van Renesse and Deniz Altinbuken, “Paxos Made Moderately Complex,” ACM Computing Surveys 47, no. 3 (2015), 1–36. https://doi.org/10.1145/2673577.

11 Diego Ongaro 和 John Ousterhout,“寻找可理解的共识算法”。2014 年 USENIX 年度技术会议 (USENIX ATC'14) 会议记录,305–320。美国:USENIX 协会。

11 Diego Ongaro and John Ousterhout, “In Search of an Understandable Consensus Algorithm.” In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC’14), 305–320. USA: USENIX Association.

12 Kyle Kingsbury 使用 Jepsen 测试套件提供分布式数据库一致性测试。VoltDB 6.3 的测试结果令人着迷。

12 Kyle Kingsbury provides distributed database consistency testing using the Jepsen test suite. The results for testing VoltDB 6.3 are a fascinating read.

13 James C. Corbett 等人,“Spanner:Google 的全球分布式数据库”。ACM 计算机系统汇刊 (TOCS) 31.3 (2013), 1–22。https://dl.acm.org/doi/10.1145/2491245

13 James C. Corbett et al., “Spanner: Google’s Globally Distributed Database.” ACM Transactions on Computer Systems (TOCS) 31.3 (2013), 1–22. https://dl.acm.org/doi/10.1145/2491245.

14或基于全球商定的订单串行执行交易——参见“VoltDB”

14 Or serialized execution of transactions based on a globally agreed order—see “VoltDB”.

15 Spencer Kimball 和 Irhan Sharif 撰写的这篇博文对分布式 SQL 数据库如何接近最高性能进行了出色的分析。与基于 NTP 的时钟的一致性保证。

15 This blog post by Spencer Kimball and Irhan Sharif is an excellent analysis of how distributed SQL databases can approach the highest consistency guarantees with NTP-based clocks.

16 Alexander Thompson 等人,2012 年。“Calvin:分区数据库系统的快速分布式事务”。2012 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '12) 会议记录,1-12。美国纽约州纽约市:计算机协会。

16 Alexander Thompson et al, 2012. “Calvin: Fast Distributed Transactions for Partitioned Database Systems.” In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12), 1–12. New York, NY, USA: Association for Computing Machinery.

第 13 章分布式数据库实现

Chapter 13. Distributed Database Implementations

在前面的三章中,我描述了可扩展分布式数据库中广泛采用的各种分布式系统原理和架构。这些使得在多个存储节点上分区和复制数据成为可能,并支持复制数据对象的不同一致性和可用性模型。

In the previous three chapters, I’ve described the various distributed system principles and architectures that are widely employed in scalable distributed databases. These make it possible to partition and replicate data over multiple storage nodes, and support different consistency and availability models for replicated data objects.

具体数据库如何基于这些原则构建高度依赖于数据库。不同的数据库提供商在众所周知的方法中进行挑选,并设计自己的专有机制,以实现他们希望在其产品中推广的软件架构质量属性。这意味着在架构和功能上表面上相似的数据库的行为可能会非常不同。即使是相同功能(例如初选)的实现,其跨数据库的性能和稳健性也可能存在很大差异。

Precisely how specific databases build on these principles is highly database dependent. Different database providers pick and choose among well-understood approaches, as well as designing their own proprietary mechanisms, to implement the software architecture quality attributes they wish to promote in their products. This means databases that are superficially similar in their architectures and features will likely behave very differently. Even implementations of the same feature—for example, primary election—can vary significantly in terms of their performance and robustness across databases.

因此,针对特定用例评估数据库技术需要知识和勤奋。您需要了解候选技术的基本架构和数据模型如何在可扩展性、可用性、一致性以及其他质量(例如超出本书范围的安全性)方面满足您的要求。为了有效地做到这一点,您需要深入研究并深入了解应用程序的高优先级功能如何工作。我认为如果我告诉您忠实地相信营销材料的危险,不会让任何人感到惊讶。向乔治·奥威尔致歉,所有数据库都是可扩展的,但有些数据库比其他数据库更具可扩展性。

Evaluating a database technology for a specific use case therefore requires both knowledge and diligence. You need to understand how the basic architecture and data model of a candidate technology match your requirements in terms of scalability, availability, consistency, and of course other qualities such as security that are beyond the scope of this book. To do this effectively, you need to delve under the hood and gain insights into precisely how high-priority features for your application work. I don’t think I’d surprise anyone by telling you about the dangers of faithfully believing marketing materials. With apologies to George Orwell, all databases are scalable, but some are more scalable than others.

在本章中,我将简要回顾三种广泛部署的分布式数据库的显着特征,即 Redis、MongoDB 和 DynamoDB。这些实现中的每一个都支持不同的数据模型,并在一致性与可用性连续体上做出非常不同的权衡。这些设计决策渗透到每个系统提供的性能和可扩展性。

In this chapter I’ll briefly review the salient features of three widely deployed distributed databases, namely Redis, MongoDB, and DynamoDB. Each of these implementations support different data models and make very different trade-offs on the consistency-versus-availability continuum. These design decisions percolate through to the performance and scalability each system offers.

我采用的方法可以作为进行您自己的数据库平台比较的蓝图。您会发现本书中已经讨论过的许多概念在这里再次出现。您还将看到解决分布式数据库中面临的一些问题的特定于产品的方法。一如既往,魔鬼深深潜伏在细节中。

The approach I take can work as a blueprint for carrying out your own database platform comparisons. You’ll see many of the concepts already discussed in this book raising their heads here again. You’ll also see product-specific approaches to solving some of the problems faced in distributed databases. As always, the devil lurks deeply in the details.

雷迪斯

Redis

自 2009 年首次发布以来,Redis越来越受欢迎,成为最广泛部署的分布式数据库。Redis 的主要吸引力在于它能够充当分布式缓存和数据存储。Redis 维护内存中的数据存储,称为数据结构存储。客户端向 Redis 服务器发送命令以操作其保存的数据结构。

Since its initial release in 2009, Redis has grown in popularity to become one of the most widely deployed distributed databases. The main attraction of Redis is its ability to act as both a distributed cache and data store. Redis maintains an in-memory data store, known as a data structure store. Clients send commands to a Redis server to manipulate the data structures it holds.

Redis是用C实现的,使用单线程事件循环来处理客户端请求。在版本 6.0 中,此事件循环增加了额外的线程来处理网络操作,以便为事件循环提供更多带宽来处理客户端请求。这使得 Redis 服务器能够更好地利用多核节点并提供更高的吞吐量。

Redis is implemented in C and uses a single-threaded event loop to process client requests. In version 6.0, this event loop was augmented with additional threads to handle network operations in order to provide more bandwidth for the event loop to process client requests. This enables a Redis server to better exploit multicore nodes and provide higher throughput.

为了提供数据安全,可以使用两种方法使单个 Redis 服务器维护的内存中数据结构变得持久。其中之一是,您可以配置定期后台线程将内存内容转储到磁盘。此快照过程使用fork()系统调用,因此如果内存内容很大,则成本可能会很高。在高吞吐量系统中,快照通常以数十秒的间隔配置。还可以在可配置的写入次数后触发快照,以提供已知的潜在数据丢失范围。

To provide data safety, the in-memory data structure maintained by a single Redis server can be made durable using two approaches. In one, you can configure a periodic background thread to dump the memory contents to disk. This snapshot process uses the fork() system call, and hence can be expensive if the memory contents are large. In high-throughput systems, snapshots are typically configured at intervals of tens of seconds. Snapshots can also be triggered after a configurable number of writes to provide a known bound of potential data loss.

另一种方法是将 Redis 配置为将每个命令记录到仅附加文件 (AOF)。这本质上是一个操作日志,默认每秒保存一次。同时使用快照和操作日志这两种方式,可以提供最大的数据安全保障。如果服务器崩溃,可以根据最新快照重播 AOF,以在内存中重新创建服务器数据内容。

The other approach is to configure Redis to log every command to an append-only file (AOF). This is essentially an operation log, and is persisted by default every second. Using both approaches, namely snapshots and operation logging, provides the greatest data safety guarantees. In the event of a server crash, the AOF can be replayed against the latest snapshot to recreate the server data contents in memory.

数据模型和API

Data Model and API

Redis 是一个键值存储。它提供了一小部分数据结构,应用程序可以使用这些数据结构来创建与唯一键关联的数据对象。每个数据结构都有一组定义的命令,应用程序使用这些命令来创建、操作和删除数据对象。命令很简单,并且对由键标识的单个对象进行操作。

Redis is a key-value store. It offers a small collection of data structures that applications can use to create data objects associated with unique keys. Each data structure has a set of defined commands that applications use to create, manipulate, and delete data objects. Commands are simple and operate on a single object identified by the key.

Redis 的核心结构是:

The core Redis structures are:

弦乐
Strings
字符串在 Redis 中用途广泛能够存储最大长度为 512 MB 的文本和二进制数据。例如,您可以使用字符串作为随机访问向量,并在指定的子范围上使用get()set()操作。字符串也可用于表示和操作计数器。
Strings are versatile in Redis and are able to store both text and binary data with a maximum of 512 MB in length. For example, you can use strings as a random access vector using get() and set() operations on specified subranges. Strings can also be used to represent and manipulate counters.
链接列表
Linked lists
这些是带有操作的字符串列表操作列表头部、尾部和主体的元素。
These are lists of strings, with operations to manipulate elements at the head, tail, and in the body of the list.
集合和排序集合
Sets and sorted sets
集合代表一组独特的字符串。排序集将分数值与每个元素相关联,并按分数升序维护字符串。这使得可以通过分数或排名顺序有效地访问集合中的元素。
Sets represent a collection of unique strings. Sorted sets associate a score value with each element and maintain the strings in ascending score order. This makes it possible to efficiently access elements in the set by score or rank order.
哈希值
Hashes
与 Python 映射类似,Redis 哈希映射表示为一个或多个字符串值的字符串的键值。哈希是用于表示应用程序数据对象(例如用户配置文件或库存)的主要 Redis 结构。
Like a Python map, a Redis hash maps a key value represented as a string to one or more string values. Hashes are the primary Redis structure for representing application data objects such as user profiles or stock inventory.

对单个键的操作是原子的。你可以multi还可以使用和命令指定一组需要原子执行的操作execmulti您在和 之间放置的所有命令exec称为 Redis 事务,并按顺序序列化和执行。下面的代码示例是 Redis 事务的示例,它定义了具有两个操作的事务。第一个将表示新客户订单的字符串添加到列表中neworderslastorder第二个修改用户哈希图中键的值。Redis 服务器将这些命令排队,直到收到命令exec,然后按顺序执行它们:

Operations on a single key are atomic. You can also specify a group of operations as requiring atomic execution using the multi and exec commands. All commands you place between multi and exec are called Redis transactions, and are serialized and executed in order. An example of a Redis transaction is in the code example below, which defines a transaction with two operations. The first adds a string representing a new customer order to a neworders list. The second modifies the value of the key lastorder in the hashmap for the user. A Redis server queues these commands until it receives the exec command, and then executes them in sequence:

多
lpush neworders “orderid 600066 客户 89788 项目 788990 金额 11 日期 12/24/21”
hmset 用户:89788 最后订单 600066
执行
multi
lpush neworders “orderid 600066 customer 89788 item 788990 amount 11 date 12/24/21”
hmset user:89788 lastorder 600066
exec

交易本质上是唯一途径执行跨多种类型移动或计算数据的操作。然而,它们是有限的,因为它们仅在所有命令成功时才提供原子性。如果命令失败,则没有回滚功能。这意味着即使一个命令失败,事务中的其余命令仍然会被执行。类似地,如果服务器在执行事务时发生崩溃,则服务器将处于未知状态。使用 AOF 持久性机制,您可以在重新启动时以管理方式修复状态。事实上,Redis 事务有点用词不当。他们当然不是酸性的。

Transactions are essentially the only way to perform operations that move or compute data across multiple types. They are limited, however, in that they only provide atomicity when all commands succeed. If a command fails, there are no rollback capabilities. This means that even if one command fails, the remaining commands in the transaction will still be executed. Similarly, if a crash occurs while the server is executing the transaction, the server is left in an unknown state. Using the AOF durability mechanism, you can fix the state administratively on restart. In reality, Redis transactions are somewhat of a misnomer; they certainly aren’t ACID.

分发和复制

Distribution and Replication

在其原始版本中,Redis 是单个服务器数据存储,这在一定程度上限制了其可扩展性。2015年,Redis集群发布以方便分区和复制Redis 数据存储跨多个节点。Redis Cluster 为集群定义了 16,384 个哈希槽。每个密钥都以 16,384 为模进行散列到特定插槽,该插槽配置为驻留在集群中的主机上。如图 13-1所示,其中具有唯一标识符的四个节点组成了集群,并且为每个节点分配了相等范围的哈希槽。

In its original version, Redis was a single server data store, which somewhat limited its scalability. In 2015, Redis Cluster was released to facilitate partitioning and replication of a Redis data store across multiple nodes. Redis Cluster defines 16,384 hash slots for a cluster. Every key is hashed modulo 16,384 to a specific slot, which is configured to reside on a host in the cluster. This is illustrated in Figure 13-1, in which four nodes with unique identifiers comprise the cluster and an equal range of hash slots is assigned to each.

使用哈希槽在 Redis 中进行分片
图 13-1。使用哈希槽在 Redis 中进行分片

集群中的每个节点都运行一个 Redis 服务器和一个处理集群中节点间通信的附加组件。雷迪斯使用称为集群总线的协议来实现集群中每个节点之间的直接 TCP 通信。节点维护集群中所有其他节点的状态信息,包括每个节点服务的哈希槽。Redis 使用八卦协议来实现此功能,该协议可以有效地使节点能够跟踪集群中所有节点的状态。

Each node in the cluster runs a Redis server and an additional component that handles internode communications in the cluster. Redis uses a protocol known as the Cluster bus to enable direct TCP communications between every node in the cluster. Nodes maintain state information about all other nodes in the cluster, including the hash slots that each node serves. Redis implements this capability using a gossip protocol that efficiently enables nodes to track the state of all the nodes in the cluster.

客户端可以连接到集群中的任何节点并提交命令来操作指定的键。如果命令到达不管理给定对象的哈希槽的节点,它将查找托管所需哈希槽的服务器的地址。然后,它向客户端响应错误MOVED以及哈希槽中的键所在的节点的地址。然后客户端必须将命令重新发送到正确的节点。通常,Redis 客户端驱动程序将维护一个将哈希槽映射到服务器节点的内部目录,以便在集群稳定时不会发生重定向。

Clients can connect to any node in the cluster and submit commands to manipulate specified keys. If a command arrives at a node that does not manage the hash slot for a given object, it looks up the address of the server that hosts the required hash slot. It then responds to the client with a MOVED error and the address of the node where the keys in the hash slot reside. The client must then resend the command to the correct node. Typically, Redis client drivers will maintain an internal directory that maps hash slots to server nodes so that redirections do not occur when the cluster is stable.

该架构的另一个含义是事务中的命令必须访问驻留在同一哈希槽中的密钥。Redis 不具备对驻留在不同哈希槽和不同节点的对象执行命令的功能。这需要仔细的数据建模来解决这个限制。Redis 确实提供了对使用称为哈希标签的概念的解决方法的支持,该标签根据对于不同对象相同的密钥子字符串强制将密钥放入相同的哈希槽中。

Another implication of this architecture is that commands in transactions must access keys that reside in the same hash slot. Redis does not have capabilities to execute commands on objects that reside in different hash slots and different nodes. This requires careful data modeling to work around this limitation. Redis does provide support for a workaround using a concept known as hash tags which force keys into the same hash slot based on a substring of the key which is identical for different objects.

您可以调整 Redis 集群的大小以添加新节点或从集群中删除节点。发生这种情况时,必须将哈希槽分配给新节点或从已删除的节点移动到现有节点。CLUSTER您可以使用修改节点的群集配置信息的管理命令来执行此操作。一旦哈希槽重新分配给不同的节点,Redis 就会自动迁移迁移哈希槽中的对象。对象被序列化并从其现有主节点发送到新的主节点。当一个对象被成功确认时,它将从原始主节点中删除,并在新位置对客户端可见。

You can resize a Redis Cluster to add new nodes or remove nodes from the cluster. When this occurs, hash slots must be assigned to the new nodes or moved from the deleted nodes to existing nodes. You perform this action using the CLUSTER administrative command that modifies a node’s cluster configuration information. Once hash slots are reassigned to a different node, Redis migrates the objects in the migrated hash slots automatically. Objects are serialized and sent from their existing home node to the new home node. When an object is successfully acknowledged, it is removed from the original home node and becomes visible to clients at its new location.

您还可以使用主副本架构复制集群中的每个节点。主节点异步更新副本以提供数据安全。要扩展读取工作负载,您可以配置副本来处理读取命令。默认情况下,主节点不会等到副本确认更新才将成功返回给客户端。

You can also replicate every node in a cluster using a primary-replica architecture. The primary updates replicas asynchronously to provide data safety. To scale out read workloads, you can configure replicas to handle read commands. By default, the primary does not wait until replicas acknowledge an update before returning success to the client.

或者,客户端可以WAIT在更新后发出命令。这指定了应确认更新的副本数量以及应WAIT返回的超时期限。超时时间为零指定客户端应无限期阻塞。在以下示例中,客户端会阻塞,直到两个副本已确认更新或 500 毫秒超时到期。无论哪种情况,Redis 都会返回已更新的副本数量:

Optionally, the client can issue a WAIT command after an update. This specifies the number of replicas that should acknowledge the update and a timeout period after which the WAIT should return. A timeout period of zero specifies that the client should block indefinitely. In the following example, the client blocks until two replicas have acknowledged updates, or a 500 milliseconds timeout expires. In either case, Redis returns the number of replicas that have been updated:

等待 2 500
WAIT 2 500

如果主数据库发生故障,副本将升级为主数据库。Redis 使用自定义的初选算法。检测到其主节点发生故障的副本会开始选举,并尝试从集群中的大多数主节点获得投票。如果它获得多数票,它会将自己提升为主要节点并通知集群中的节点。选举算法使副本能够交换信息,以尝试确定哪个副本是最新的。但是,不能保证最新的副本最终会升级为主副本。因此一些数据如果过时的副本成为主副本,则可能会发生丢失。

In the event of a primary failure, a replica is promoted to primary. Redis uses a custom primary election algorithm. A replica that detects its primary has failed starts an election and attempts to obtain a vote from a majority of primary nodes in the cluster. If it obtains a majority, it promotes itself to primary and informs the nodes in the cluster. The election algorithm enables replicas to exchange information to try and determine which replica is most up to date. There is no guarantee, however, that the most up-to-date replica will eventually be promoted to primary. Hence some data loss is possible if an out-of-date replica becomes primary.

长处和短处

Strengths and Weaknesses

看待 Redis 以及事实上大多数内存数据库的一种方式是,它本质上是一个具有尾随持久性的磁盘支持的缓存。该架构具有固有的性能与数据安全权衡。我将在下面的小节中深入探讨这在 Redis 中是如何体现的。

One way to think about Redis, and in fact most in-memory databases, is that it is essentially a disk-backed cache with trailing persistence. This architecture has an inherent performance versus data safety trade-off. I’ll dig into how this manifests in Redis in the following subsections.

表现

Performance

Redis 专为低延迟而设计响应和高吞吐量。主要数据存储是主存储器,可实现快速数据对象访问。数据结构和操作的有限集合也使 Redis 能够优化请求并使用节省空间的数据对象表示。只要您可以在 Redis 数据类型的限制内设计数据模型,您就应该看到一些非常令人印象深刻的性能。

Redis is designed for low latency responses and high throughput. The primary data store is main memory, making for fast data object access. The limited collection of data structures and operations also make it possible for Redis to optimize requests and use space-efficient data object representations. As long as you can design your data model within the constraints of the Redis data types, you should see some very impressive performance.

数据安全

Data safety

Redis 以数据安全换取性能。在默认配置中,AOF 写入之间有 1 秒的窗口在此期间崩溃可能会导致数据丢失。您可以通过在每次写入时保留 AOF 来提高数据安全性。不幸的是,在重写入负载下,此配置对性能的影响很大。

Redis trades off data safety for performance. In the default configuration, there is a 1-second window between AOF writes during which a crash can cause data loss. You can improve data safety by persisting the AOF on every write. Unfortunately, the performance hit of this configuration is substantial under heavy write loads.

Redis 还使用专有的复制和主选举算法。不是最新的副本可以被选为领导者,因此保留在前一个领导者处的数据可能会丢失。

Redis also uses a proprietary replication and primary election algorithm. A replica that is not up to date can be elected as leader, and hence data persisted at the previous leader may be lost.

最重要的是,如果不能避免数据丢失,您可能不想使用 Redis(或任何内存数据库)作为主要数据存储。1但如果您可以容忍偶尔的数据丢失,Redis 确实可以提供非常令人印象深刻的吞吐量。

The bottom line is that you probably don’t want to use Redis (or any in-memory database) as your primary data store if data loss is not an option.1 But if you can tolerate occasional data loss, Redis can provide very impressive throughput indeed.

可扩展性

Scalability

Redis Cluster是首要的可扩展性Redis 的机制。它允许多达 1,000 个节点托管分布在 16,384 个哈希槽中的分片数据库。每个主数据库的副本还可以服务读取请求,从而实现读取工作负载的扩展。如果您需要超过 1,000 个主节点,则必须相应地设计数据存储。

Redis Cluster is the primary scalability mechanism for Redis. It allows up to 1,000 nodes to host sharded databases distributed across 16,384 hash slots. Replicas for each primary can also serve read requests, enabling scaling of read workloads. If you need more than 1,000 primary nodes, then you must design your data store accordingly.

一致性

Consistency

Redis复制提供了最终的默认情况下基于异步复制的一致性。因此,从副本读取过时的数据是可能的。使用该WAIT命令,复制方法变得有效同步,因为主副本在请求数量的副本确认更新之前不会响应客户端。代价WAIT是更长的延迟。此外,它仅保证数据驻留在副本的内存中。在 AOF 写入的下一个快照之前副本崩溃可能会导致更新丢失。

Redis replication provides eventual consistency by default based on asynchronous replication. Stale reads from replicas are therefore possible. Using the WAIT command, the replication approach becomes effectively synchronous, as the primary does not respond to the client until the requested number of replicas have acknowledged the update. The trade-off of WAIT is longer latencies. In addition, it only guarantees data resides in memory in replicas. A replica crash before the next snapshot of AOF write could lead to the update being lost.

可用性

Availability

Redis Cluster 为各个数据库分片实现了久经考验的主副本架构。写入可用性不可避免地会受到领导者失败的影响。在副本晋升为领导者之前,给定分片的写入操作将不可用。

Redis Cluster implements a tried-and-tested primary-replica architecture for individual database shards. Write availability is inevitably impacted by leader failure. Writes will be unavailable for a given shard until a replica is promoted to leader.

网络故障可能会将 Redis 集群部署分为多数分区和少数分区。这对可用性和数据安全都有影响。客户端写入可以继续到两个分区中的所有领导节点,只要它们至少有一个可用的副本。如果领导者从少数分区中的副本中分离出来,则最初仍可对同样驻留在少数分区中的客户端进行写入操作。超时后,分区领导者将停止接受写入,因为它无法向其副本发送更新。同时,多数分区中将进行领导者选举,并且副本将被提升为主分区。当分区修复时,分区期间对前一个领导者所做的写入修改将丢失。

Network faults can split a Redis Cluster deployment into majority and minority partitions. This has implications for both availability and data safety. Client writes can continue to all leader nodes in both partitions as long as they have at least one replica available. If a leader is split from its replicas in a minority partition, writes are still initially available for clients that also reside in the minority partition. After a timeout period, the partitioned leader will stop accepting writes as it cannot send updates to its replicas. Concurrently, a leader election will occur in the majority partition and a replica will be promoted to primary. When the partition heals, the write modifications made to the previous leader while partitioned will be lost.

MongoDB

MongoDB

自 2009 年首次发布以来,MongoDB 一直处于 NoSQL 数据库运动的前沿。它通过本质上协调数据库模型与对象模型,直接解决了众所周知的对象关系阻抗不匹配问题。最好将生成的文档数据库视为 JSON 数据库。您可以将业务对象转换为 JSON,并直接将数据作为文档存储、查询和操作。不需要复杂的对象关系映射器。结果是直观且更简单的业务逻辑。

MongoDB has been at the forefront of the NoSQL database movement since its first release in 2009. It directly addressed the well-known object-relational impedance mismatch by essentially harmonizing the database model with object models. The resulting document database can be best thought of as a JSON database. You can transform your business objects to JSON and store, query, and manipulate your data directly as a document. No elaborate object-relational mapper is needed. The result is intuitive and simpler business logic.

MongoDB 最初的流行是由它的易于编程和使用。早期版本中的底层存储引擎(称为 MMAPv1、2)还有一些不足之处。MMAPv1 使用系统调用实现内存映射文件mmap()。相同逻辑分组(称为集合)中的文档在磁盘上连续分配。这对于顺序读取性能非常有用。但是,如果对象的大小增加,则必须分配新空间并更新所有文档索引。这可能是一项成本高昂的操作,并会导致磁盘碎片。

The initial popularity of MongoDB was driven by its ease of programming and use. The underlying storage engine in the early releases, known as MMAPv1,2 left something to be desired. MMAPv1 implements memory-mapped files using the mmap() system call. Documents in the same logical groupings, known as collections, are allocated contiguously on disk. This is great for sequential read performance. But if an object grows in size, new space has to be allocated and all document indexes updated. This can be a costly operation, and leads to disk fragmentation.

为了最大限度地降低此成本,MMAPv1 最初为文档分配额外的空间以适应增长。确实是一个解决方案,但可能不是最节省空间且可扩展的。此外,更新的文档锁是在非常粗粒度的级别(例如,在各种版本、服务器、数据库、集合中)获得的,导致写入性能不那么出色。

To minimize this cost, MMAPv1 initially allocates documents with additional space to accommodate growth. A solution indeed, but perhaps not the most space efficient and scalable. In addition, document locks for updates are obtained at very coarse grain levels (e.g., in various releases, server, database, collection), causing less than spectacular write performance.

2015年左右,开发团队重新设计了MongoDB以支持可插拔存储引擎架构。不久之后,新的存储引擎 WiredTiger 成为 MongoDB v3.2 中的默认引擎。WiredTiger 解决了 MMAPv1 的许多缺点。3它引入了乐观并发控制和文档级锁定、压缩、操作日志记录和用于崩溃恢复的检查点,以及其自己的内部缓存以提高性能。

Around 2015, the development team reengineered MongoDB to support a pluggable storage engine architecture. Soon after, a new storage engine, WiredTiger, became default in MongoDB v3.2. WiredTiger addresses many of the shortcomings of MMAPv1.3 It introduces optimistic concurrency control and document-level locking, compression, operational journaling and checkpointing for crash recovery, and its own internal cache for improved performance.

数据模型和API

Data Model and API

MongoDB 文档基本上是 JSON 对象二进制 JSON (BSON) 规范中定义的一组扩展类型。文档以 BSON 格式存储并组织在包含一个或多个集合的数据库中。集合相当于关系数据库表,但没有定义的模式。这意味着 MongoDB 集合不要在文档上强制执行结构。不同结构的文档可以存储在同一个集合中。这是一种无模式或读取模式方法,要求应用程序在访问时解释文档结构。

MongoDB documents are basically JSON objects with a set of extended types defined in the Binary JSON (BSON) specification. Documents are stored in BSON format and organized in databases comprising one or more collections. Collections are equivalent to a relational database table, but without a defined schema. This means MongoDB collections do not enforce a structure on documents. Documents with different structures can be stored in the same collection. This is a schemaless, or schema-on-read approach that requires the application to interpret a document structure on access.

MongoDB 文档由名称-值对组成。字段的值可以是任何 BSON 数据类型。文档还可以合并其他文档(称为嵌入或嵌套文档)以及值或文档数组。每个文档都有一个_id字段充当主键。应用程序可以在创建文档时设置此键值,或者允许 MongoDB 客户端自动分配唯一值。您还可以在集合中的任何字段、子字段或多个字段(复合键)上定义二级索引。

MongoDB documents are composed of name-value pairs. The value of a field may be any BSON data type. Documents can also incorporate other documents, known as embedded or nested documents, and arrays of values or documents. Every document has an _id field which acts as the primary key. Applications can set this key value on document creation, or allow the MongoDB client to automatically allocate a unique value. You can also define secondary indexes on any field, subfield or on multiple fields—a compound key—in a collection.

下面的代码显示了您可能在滑雪者管理系统中找到的示例文档。该字段skiresorts表示为字符串数组,每个不同的滑雪日由嵌套文档数组中的一个元素表示:

An example document that you might find in a skier management system is shown in the code below. The field skiresorts is represented as an array of strings, and each different ski day is represented by an element in an array of nested documents:

{
    _id: 6788321471
    姓名:{ 名字:“伊恩”,姓氏:“戈顿” }
    地点:“美国-华盛顿州-西雅图”,
    滑雪场:[“水晶山”、“使命岭”]
    天数:2
    第 21 季 {
          {
               天:1
               度假村:“水晶山”,
               垂直:30701
               电梯:27
               日期:“2021 年 12 月 1 日”
          }
          {
               天:2
               度假村:“使命岭”,
               垂直:17021
               电梯:10  
               日期:“2021 年 12 月 8 日”
          }
    }     
}
{
    _id: 6788321471
    name: { first: "Ian", last: "Gorton" }
    location: "USA-WA-Seattle",
    skiresorts: ["Crystal Mountain", “Mission Ridge”]
    numdays: 2
    season21 {
          {
               day: 1 
               resort: "Crystal Mountain",
               vertical: 30701
               lifts: 27
               date: “12/1/2021”
          }
          {
               day: 2
               resort: "Mission Ridge",
               vertical: 17021
               lifts: 10  
               date: “12/8/2021”
          }
    }     
}

由于集合中没有统一的文档结构,因此存储引擎需要保留每个文档的字段名称和值。对于小文档,长字段名称最终可能会占文档大小的大部分。较短的字段名称可以减少磁盘上文档的大小,并且在具有数百万文档的集合中,这种节省将变得非常重要。4优化的文档大小可减少磁盘使用、内存和缓存消耗以及网络带宽。与往常一样,在规模上,小的优化可以在最大限度地减少资源利用率方面获得多次回报。

As there is no uniform document structure in a collection, the storage engine needs to persist field names and values for every document. For small documents, long field names may end up representing the majority of the document size. Shorter field names can reduce the size of the document on disk, and at scale, in a collection with many millions of documents, this saving will become significant.4 Optimized document sizes reduce disk usage, memory and cache consumption, and network bandwidth. As usual, at scale, small optimizations can pay back many times in minimizing resource utilization.

为了操作文档,MongoDB 提供了用于基本 CRUD 操作的 API。有一种.find()方法具有一组广泛的条件和运算符,可以模拟SELECT单个集合中文档的 SQL 语句。MongoDB 支持使用$match$group运算符进行聚合查询,并且该运算符在同一数据库中的集合之间$lookup提供类似 SQL的行为。JOIN下面显示了查询集合的简单示例。该.find()操作会从集合中返回已注册超过 20 个滑雪日的滑雪者的所有文档skiers2021

To manipulate documents, MongoDB provides APIs for basic CRUD operations. There is a .find() method with an extensive set of conditions and operators that emulate an SQL SELECT statement for documents in a single collection. MongoDB supports aggregate queries with the $match and $group operators, and the $lookup operator provides SQL JOIN-like behavior across collections in the same database. A simple example of querying a collection is shown in the following. The .find() operation returns all documents for skiers who have registered more than 20 ski days from the skiers2021 collection:

db.skiers2021.find({ numdays: { $gt: 20 } })
db.skiers2021.find({ numdays: { $gt: 20 } })

对 MongoDB 中单个文档的写入操作是原子的。出于这个原因,如果你非规范化你的数据模型中广泛使用嵌套文档,可以避免在应用程序代码中更新多个文档和分布式事务的复杂性。在 MongoDB 4.0 版本之前,这本质上是确保多文档更新一致性且无需复杂的应用程序逻辑来处理故障的唯一方法。

Write operations to a single document in MongoDB are atomic. For this reason, if you denormalize your data model to make extensive use of nested documents, you can avoid the complexities of updating multiple documents and distributed transactions in your application code. Before MongoDB version 4.0, this was essentially the only way to ensure consistency for multidocument updates without complex application logic to handle failures.

从4.0版本开始,已经实现了对ACID、多文档事务的支持。MongoDB 事务使用两阶段提交并利用底层 WiredTiger 存储引擎的快照隔离功能。快照隔离的保证比ACID 语义隐含的序列化。这可以实现比序列化更高的性能,并避免大多数(但不是全部)串行化所带来的并发异常。避免。快照隔离实际上是许多关系数据库的默认设置,包括 Oracle 和 PostgreSQL。

Since version 4.0, support for ACID, multidocument transactions has been implemented. MongoDB transactions use two-phase commit and leverage the underlying WiredTiger storage engine’s snapshot isolation capabilities. Snapshot isolation is a weaker guarantee than the serialization implied by the ACID semantics. This enables higher performance than serialization and avoids most, but not all, of the concurrency anomalies that serializability avoids. Snapshot isolation is actually the default in many relational databases, including Oracle and PostgreSQL.

分发和复制

Distribution and Replication

要水平扩展,您可以使用 MongoDB 在两种数据分区或分片策略之间进行选择。这些都是分别基于哈希和基于范围的分片。您定义一个分片基于一个或多个字段值的每个文档的键。创建文档后,MongoDB 将根据以下任一条件选择数据库分片来存储文档:

To scale horizontally, you can choose between two data partitioning or sharding strategies with MongoDB. These are hash-based and range-based sharding, respectively. You define a shard key for each document based on one or more field values. Upon document creation, MongoDB then chooses a database shard to store the document based on either:

  • 应用于分片键的哈希函数的结果

  • The result of a hash function applied to the shard key

  • 定义用于存储键所在的分片键范围的分片

  • The shard that is defined to store the shard key range within which the key resides

MongoDB 中的分片部署需要您部署几个不同的数据库组件。mongod进程是必须在每个分片上运行MongoDB 数据库守护进程。mongos进程负责通过将请求路由到目标分片并将结果返回给客户端来处理数据库客户端查询。客户端使用 MongoDB 驱动程序发出 MongoDB API 调用。配置服务器存储数据库集群配置元数据,mongos使用这些元数据根据分片键值将查询路由到正确的分片。该架构如图 13-2所示。

Sharded deployments in MongoDB require you to deploy several distinct database components. The mongod process is the MongoDB database daemon that must run on every shard. The mongos process is responsible for processing database client queries by routing requests to the targeted shard(s) and returning the results to the client. Clients issue MongoDB API calls using a MongoDB driver. Config servers store database cluster configuration metadata, which the mongos uses to route queries to the correct shards based on shard key values. This architecture is depicted in Figure 13-2.

MongoDB数据库分区架构
图 13-2。MongoDB数据库分区架构

mongos进程充当客户端的 MongoDB 驱动程序和数据库分片之间的代理所有客户端请求都必须通过mongos实例。mongos没有持久状态,只是缓存从配置服务器获取集群配置信息。

The mongos process acts as a proxy between the client’s MongoDB driver and the database shards. All client requests must pass through a mongos instance. A mongos has no persistent state, and simply caches the cluster configuration information it obtains from the config servers.

mongos进程是客户端唯一的查询接口。因此,拥有足够的mongos处理能力对于性能和可扩展性至关重要。具体如何配置mongos部署很大程度上取决于您的应用程序需求,MongoDB 为您提供了设计系统的灵活性,以满足所需的工作负载。有三种基本替代方案,如图13-3所示:

The mongos process is the client’s only query interface. It is therefore critical for performance and scalability that sufficient mongos processing capacity is available. Precisely how you configure mongos deployments is highly dependent on your applications needs, and MongoDB provides you with flexibility to design your system to satisfy the required workload. There are three basic alternatives, as depicted in Figure 13-3:

MongoDB 数据库部署替代方案
图 13-3。MongoDB 数据库部署替代方案
配置(A)
Configuration (A)
在每个应用程序上部署一个mongos充当 MongoDB 客户端的服务器。通过使每个客户端对mongos 的请求成为本地调用,可以减少延迟。
Deploy a mongos on each application server that acts as a MongoDB client. This reduces latency by making every client request to mongos a local call.
配置(B)
Configuration (B)
在每个数据库分片上部署mongos。在此配置中,mongos可以与本地分片通信。
Deploy a mongos on every database shard. In this configuration, a mongos can communicate with the shard locally.
配置(C)
Configuration (C)
在自己的专用硬件上部署一组mongos 。与客户端和数据库分片的通信会产生额外的网络延迟。权衡是消除了应用程序服务器和数据库节点的mongos负载,并且为mongos进程分配了更多的独占处理能力。
Deploy a collection of mongos on their own dedicated hardware. You incur additional network latency communicating with the client and database shards. The trade-off is that the mongos load is eliminated from the application server and database nodes, and the mongos processes are allocated more exclusive processing capacity.

在每个分片中,MongoDB 将文档存储在称为块的存储单元中。默认情况下,一个块的最大大小为 64 MB。当块增长超过其最大配置大小时,MongoDB 会自动将该块拆分为两个或更多新块。块分割是元数据更改,由插入或更新触发,不涉及任何数据移动。

Within each shard, MongoDB stores documents in storage units known as chunks. By default a chunk is a maximum of 64 MB. When a chunk grows beyond its maximum configured size, MongoDB automatically splits the chunk into two or more new chunks. Chunk splitting is a metadata change, triggered by inserts or updates, and does not involve any data movement.

随着集群中数据的增长,分片之间的数据分布可能会变得不平衡。这会在分片上造成不均匀的负载,并可能产生热点 - 分片负载着大量对常用访问密钥的请求。热点会损害查询性能。因此,MongoDB 在主配置服务器上运行集群平衡器进程。集群平衡器监视分片之间的数据分布,如果检测到已达到(可配置的)迁移阈值,则会触发块迁移。迁移阈值基于集合中块最多的分片与块最少的分片之间的数据块数量之间的差异。

As the data grows across the cluster, the data distribution across shards can become unbalanced. This creates uneven loads on shards and can produce hotspots—shards that are heavily loaded with requests for commonly accessed keys. Hotspots impair query performance. For this reason, MongoDB runs a cluster balancer process on the primary config server. The cluster balancer monitors the data distribution across shards and if it detects that a (configurable) migration threshold has been reached, it triggers a chunk migration. Migration thresholds are based on the difference between the number of data chunks between the shard with the most chunks and the shard with the least chunks for a collection.

块迁移由平衡器发起。moveChunk它向源分片发送命令。源分片负责将块复制到目标。发生迁移时,源分片会处理对块的任何更新,并确保在迁移完成后将这些更新同步到目标分片。最后,源分片使用迁移块的新位置更新配置服务器上的集群配置元数据,并删除其块的副本。

Chunk migration is initiated by the balancer. It sends a moveChunk command to the source shard. The source shard takes responsibility for copying the chunk to the destination. While migration is occurring, the source shard handles any updates to the chunk, and it ensures these updates are synchronized to the destination shard after the migration has completed. Finally, the source shard updates the cluster configuration metadata at the config server with the migrated chunk’s new location, and deletes its copy of the chunk.

MongoDB 还支持通过分片复制增强可用性和读取查询能力。每个主分片可以有多个辅助分片,这些分片统称为副本集。所有客户端写入均由主服务器处理,并将所有更改记录到操作日志 (oplog)数据结构中。主节点定期将其 oplog 发送到辅助节点,辅助节点又将 oplog 中的修改应用到其本地数据库副本。这种方法如图 13-4所示。

MongoDB also supports enhanced availability and read query capacity through shard replication. Each primary shard can have multiple secondaries, and collectively these are known as a replica set. All client writes are processed by the primary, and it logs all changes to an operations log (oplog) data structure. Periodically, the primary ships its oplog to the secondaries, which in turn apply the modifications in the oplog to their local database copy. This approach is illustrated in Figure 13-4.

MongoDB 副本集
图 13-4。MongoDB 副本集

副本集中的节点定期发送心跳消息(默认情况下每两秒一次),以确认成员可用性。如果辅助节点在(默认)10 秒内没有收到来自主节点的心跳消息,它将开始领导者选举。领导者选举算法基于Raft。此外,如果领导者被划分为少数派分区,它将辞去领导者职务。随后将从多数分区中或当分区修复时选出新的领导者。无论哪种情况,在选举新领导者时,写入都不可用于副本集。

Nodes in a replica set send periodic heartbeat messages, by default every two seconds, to confirm member availability. If a secondary node does not receive a heartbeat message from a primary in a (by default) 10-second period, it commences a leader election. The leader election algorithm is based on Raft. In addition, if a leader is partitioned in a minority partition, it will step down as leader. A new leader will subsequently be elected from the majority partition or when the partition heals. In either case, writes are not available to the replica set while the new leader is elected.

MongoDB 支持可调一致性。您可以使用 MongoDB写入关注点来控制写入的副本一致性。在版本 5.0 中,默认值为多数,这确保在确认成功之前写入在副本集中的大多数节点上是持久的。在早期版本中,默认设置仅等待主数据库进行持久写入,从而在性能和数据安全之间进行权衡。

MongoDB supports tunable consistency. You can control replica consistency for writes using MongoDB write concerns. In version 5.0, the default is majority, which ensures writes are durable at the majority of nodes in a replica set before success is acknowledged. In earlier versions, the default setting only waited for the primary to make a write durable, trading off performance against data safety.

同样,阅读偏好使之成为可能配置副本集中的哪些节点可以处理读取。默认情况下,请求被发送到主节点,以确保一致的读取。您可以修改它以权衡读取性能和一致性。例如,您可以指定读取可由任何副本(参见图 13-4)或按最短往返时间衡量的最近副本处理。无论哪种情况,都可能发生过时读取。从最近的副本读取在地理分布广泛的部署中特别有用。您可以将主数据库定位在一个数据中心并放置位于更靠近客户端读取请求源的其他数据中心位置的副本。这减少了副本读取的网络延迟成本。

Similarly, read preferences make it possible to configure which nodes in a replica set may handle reads. By default, requests are sent to primaries, ensuring consistent reads. You can modify this to trade off read performance and consistency. For example, you can specify reads may be handled by any replica (see Figure 13-4) or the nearest replica as measured by shortest round-trip time. In either case, stale reads are possible. Reading from the nearest replica is especially useful in widely geographically distributed deployments. You can locate the primary in one data center and place replicas in other data center locations that are closer to the origins of client read requests. This reduces network latency costs for replica reads.

长处和短处

Strengths and Weaknesses

自最初发布以来,MongoDB 已经非常成熟。有吸引力的编程模型推动了最初的流行,并且 MongoDB 工程师十多年来不断发展核心平台,以提高性能、可用​​性、可扩展性和一致性。这形成了一个强大的分布式数据库平台,应用程序可以对其进行配置和调整以满足其需求。

MongoDB has matured massively since its initial releases. The attractive programming model drove initial popularity, and the core platform has been evolved by MongoDB engineers over more than a decade to improve performance, availability, scalability, and consistency. This has resulted in a powerful distributed database platform that applications can configure and tune to meet their requirements.

表现

Performance

最初的 MongoDB 版本受到影响由于写入性能不佳。在过去十年中,这一情况得到了显着改善,这在很大程度上是由 WiredTiger 存储层推动的。与大多数数据库一样,每个节点的性能都得益于分配用于内部缓存的大量本地内存空间。如果应用程序要求允许,您还可以选择读取首选项和写入关注点,以支持原始性能而不是一致性。

Initial MongoDB releases suffered from poor write performance. This has improved dramatically over the last decade, fueled to a large extent by the WiredTiger storage layer. Like most databases, each node’s performance benefits greatly from large local memory space allocated for internal caching. You can also choose read preferences and write concerns that favor raw performance over consistency if application requirements allow.

数据安全

Data safety

默认多数写入关注确保更新在副本集中的法定节点上是持久的。您可以通过指定更新只能在主数据库上持久化来实现更高的写入性能。如果主数据库在复制更新之前崩溃,则可能会导致数据丢失。基于 Raft 的领导者选举算法确保只有最新的辅助节点才能晋升为领导者,从而再次防止数据丢失。

The default majority write concern ensures updates are durable on a quorum of nodes in the replica set. You can achieve greater write performance by specifying that updates must only be made durable on the primary. This creates the potential for data loss if the primary crashes before updates are replicated. The Raft-based leader election algorithm ensures that only an up-to-date secondary can be promoted to leader, again guarding against data loss.

可扩展性

Scalability

您可以水平扩展数据集合使用分片并部署多个mongos查询路由器进程。跨节点的自动数据重新平衡有助于利用集群容量在整个集群中均匀分布请求。您可以添加新节点和淘汰现有节点,MongoDB 集群平衡器会自动在集群中移动块以利用容量。您可以通过启用对副本集中辅助节点的读取来扩展读取负载。

You can scale data collections horizontally using sharding and by deploying multiple mongos query router processes. Automatic data rebalancing across nodes helps spread requests evenly across the cluster, utilizing cluster capacity. You can add new and retire existing nodes, and the MongoDB cluster balancer automatically moves chunks across the cluster to utilize capacity. You can scale read loads by enabling reads to secondaries in a replica set.

一致性

Consistency

跨多个分片集合的 ACID 事务的可用性为开发人员提供了事务处理一致性能力。您还可以使用适当的写入关注点设置来实现副本一致性。基于会话的因果一致性提供了 RYOW 功能。您还可以确保单个文档的线性化读取和写入。这需要将读取关注设置为可线性化,并将写入关注值设置为多数5

The availability of ACID transactions across multiple sharded collections provides developers with transaction consistency capabilities. You can also achieve replica consistency using appropriate write concerns settings. Session-based causal consistency provides RYOWs capabilities. You can also ensure linearizable reads and writes for single documents. This requires a read concern setting of linearizable and a write concern value of majority.5

可用性

Availability

副本集是确保数据可用性。您应该将配置服务器配置为副本集,以确保集群元数据在出现节点故障和分区时仍然可用。您的配置还需要部署足够的mongos查询路由器进程,因为如果mongos进程不可访问,客户端将无法查询数据库。

Replica sets are the primary mechanism to ensure data availability. You should configure config servers as a replica set to ensure the cluster metadata remains available in the face of node failures and partitions. Your configurations also need to deploy sufficient mongos query router processes, as clients cannot query the database if a mongos process is not reachable.

亚马逊动态数据库

Amazon DynamoDB

Amazon 的 DynamoDB 是一项核心服务产品在 AWS 云中。它的起源可以追溯到 Werner Vogels 及其团队在 Dynamo 数据库上发表的原始研究。6 Dynamo 专为在亚马逊网站上使用而构建。内部吸取的经验教训,尤其是关于易于管理的需求的经验教训,促使 Dynamo 发展成为 2012 年公开可用、完全托管的 DynamoDB 数据库服务。

Amazon’s DynamoDB is a core service offering in the AWS Cloud. Its origins go back to the original research published by Werner Vogels and his team on the Dynamo database.6 Dynamo was built for usage on Amazon’s website. Lessons learned internally, especially about the need for ease of management, led to the evolution of Dynamo to become the publicly available, fully managed DynamoDB database service in 2012.

作为完全托管的数据库,DynamoDB 最大限度地减少了应用程序所需的数据库管理工作。复制的数据库分区由 DynamoDB 自动管理,并且数据会重新分区以满足大小和性能要求。数据项根据用户定义的分区键跨分区进行散列。各个数据项由嵌套的键值对组成,并且为了数据安全而复制 3 次。时间点恢复功能会自动执行增量备份并将其存储 35 天的滚动周期。完整备份可以随时运行,对生产系统的影响最小。

As a fully managed database, DynamoDB minimizes the database administration effort required for applications. Replicated database partitions are automatically managed by DynamoDB, and data is repartitioned to satisfy size and performance requirements. Data items are hashed across partitions based on a user-defined partition key. Individual data items comprise nested, key-value pairs, and are replicated three times for data safety. The point-in-time recovery feature automatically performs incremental backups and stores them for a rolling 35-day period. Full backups can be run at any time with minimal effect on production systems.

作为 AWS 的一部分,您需要根据使用的存储量和应用程序的 DynamoDB 使用情况。存储费用很简单。您基本上是为每 GB 数据存储付费。对应用程序使用情况进行收费更加复杂,并且会影响性​​能和可扩展性。基本上,您为数据库的每次读取和写入付费。您可以在两种模式之间进行选择,称为容量模式。按需容量模式适用于经历不可预测的流量概况(具有快速峰值和低谷)的应用程序。DynamoDB 利用其自适应容量功能来尝试确保数据库部署能够满足性能和可扩展性要求。您需要为每次操作付费

As part of AWS, you are charged based on both the amount of storage used and the application’s DynamoDB usage. Storage charges are straightforward. You basically pay for each GB of data storage. Charging for application usage is more complex, and affects both performance and scalability. Basically you pay for every read and write you make to your database. You can choose between two modes, known as capacity modes. The on-demand capacity mode is intended for applications that experience unpredictable traffic profiles with rapid spikes and troughs. DynamoDB employs its adaptive capacity capabilities to attempt to ensure the database deployment is able to satisfy performance and scalability requirements. You are charged for every operation.

对于具有更可预测负载配置文件的应用程序,您可以选择预置容量模式。您指定 DynamoDB 数据库每秒应进行的读取和写入次数以读取和写入容量单位提供。如果您的应用程序超过此读/写容量,请求可能会受到限制。DynamoDB 根据最近未使用的预置容量提供突发容量,以尽量避免限制。您还可以定义一个数据库,以根据最小和最大预配置容量限制来利用自动缩放。自动缩放会代表您在指定限制内​​动态调整预配置容量,以响应观察到的流量负载。

For applications with more predictable load profiles, you can choose provisioned capacity mode. You specify the number of reads and writes per second that your DynamoDB database should provide in terms of read and write capacity units. Should your application exceed this read/write capacity, requests may be throttled. DynamoDB provides burst capacity, based on recently unused provisioned capacity, to try to avoid throttling. You can also define a database to utilize autoscaling based on minimum and maximum provisioned capacity limits. Autoscaling dynamically adjusts the provisioned capacity on your behalf, within the specified limits, in response to observed traffic load.

DynamoDB 具有许多可选功能,这些功能可以让您更轻松地编写应用程序,也可以为您的应用程序提供更高级别的管理自动化。一般来说,经验法则是,您要求 DynamoDB 为您做的事情越多,您支付的费用就越多。例如,如果您启用时间点备份,则您每月按 GB 付费。如果您禁用此功能,则无需支付任何费用。这几乎就是世界上所有基于云的托管服务的工作方式。需要谨慎使用这些选项,尤其是大规模使用。但在大多数情况下,由于行政和管理工作的减少,您的成本会大大降低。

DynamoDB has many optional features that either make it easier for you to write applications or provide your applications with higher levels of management automation. In general, the rule of thumb is that the more you ask DynamoDB to do for you, the more you pay. For example, if you enable point-in-time backups, then you pay per GB per month. If you disable this feature, you pay nothing. This is pretty much the way the world works with all cloud-based managed services. Caution is needed in how prolifically you use these options, especially at scale. But in most cases, your costs are reduced considerably due to the reduction in administrative and management effort.

数据模型和API

Data Model and API

DynamoDB 将数据项组织在称为表的逻辑集合中。表包含多个项目,它们是由主键唯一标识。每个项目都有一组唯一标识的属性,可以选择嵌套这些属性。单个项目的大小限制为 400 KB。DynamoDB 是无模式的 — 同一表中的项目可以具有不同的属性集。

DynamoDB organizes data items in logical collections known as tables. Tables contain multiple items, which are uniquely identified by a primary key. Each item has a collection of uniquely identified attributes, which can optionally be nested. An individual item is restricted to 400 KB in size. DynamoDB is schemaless—items in the same table can have a different set of attributes.

就数据类型而言,DynamoDB 相当有限。支持的标量类型包括字符串、二进制、数字和布尔值。你可以使用列表和映射数据类型构建文档,并且这些数据类型最多可以嵌套 32 层。您还可以使用集来创建包含唯一值的命名属性。下面的代码以 DynamoDB 项目为例。主键是skierID. 该skiresorts字段由一个列表表示,并且season21是一个包含嵌套文档的地图,这些嵌套文档表示滑雪者对度假村的每次访问:

In terms of data types, DynamoDB is fairly limited. Scalar types supported include strings, binary, numbers, and Booleans. You can build documents using list and map data types, and these can be nested up to 32 levels deep. You can also use sets to create a named attribute containing unique values. The code below depicts a DynamoDB item as an example. The primary key is skierID. The skiresorts field is represented by a list, and season21 is a map containing nested documents representing each of the skier’s visits to a resort:

{
 "滑雪者ID": "6788321471",
    “姓名”: {
          “最后”:“戈顿”,
          “第一”:“伊恩”
    },
 “位置”:“美国-华盛顿州-西雅图”,
 “滑雪度假村”: [
          《水晶山》、
          《使命岭》
 ],
 “numdays”:“2”,
 “第 21 季”:{
    “第一天”:{
          “日期”:“2021 年 12 月 1 日”,
          “垂直”:30701,
          “电梯”:27,
          “度假村”:“水晶山”
  },
    “第2天”:{
          “日期”:“2021 年 12 月 8 日”,
          “垂直”:“17021”,
          “电梯”:10,
          “度假村”:“使命岭”
    }
 }
}
{
 "skierID": "6788321471",
    "Name": {
          "last": "Gorton",
          "first": "Ian"
    },
 "location": "USA-WA-Seattle",
 "skiresorts": [
          "Crystal Mountain",
          "Mission Ridge"
 ],
 "numdays": "2",
 "season21": {
    "day1": {
          "date": "12/1/2021",
          "vertical ": 30701,
          "lifts": 27,
          "resort": "Crystal Mountain"
  },
    "day2": {
          "date": "12/8/2021",
          "vertical": "17021",
          "lifts": 10,
          "resort": "Mission Ridge"
    }
 }
}

项目的主键值充当分区键,它经过哈希处理以将每个项目映射到不同的数据库分区。您还可以通过使用表中的项目定义排序键来创建复合主键。这样就可以使用相同的主键和唯一的排序键对同一分区中逻辑相关的项目进行分组;DynamoDB 仍然对主键进行哈希处理来定位分区,然后将具有相同分区键值的所有项目存储在一起,并按排序键值排序。7

The primary key value for an item acts as the partition key, which is hashed to map each item to a distinct database partition. You can also create composite primary keys by defining a sort key using items in the table. This creates the ability to group logically related items in the same partition by using the same primary key and a unique sort key; DynamoDB still hashes the primary key to locate the partition, and it then stores all items with the same partition key value together, in sorted order by sort key value.7

作为使用上面代码中的滑雪者项目的简单示例,您可以使用 作为location主键和作为skierID排序键来创建唯一的复合键。这会将同一分区中同一位置的所有滑雪者分组在一起,并按排序顺序存储它们。

As a simple example using the skier item in the code above, you could create a unique composite key using the location as the primary key and the skierID as the sort key. This would group together all skiers in the same location in the same partition, and store them in sorted order.

为了支持替代的高效查询路径,您可以在表(称为基表)上创建多个二级索引。二级索引有两种类型:本地索引和全局索引。

To support alternative efficient query paths, you can create multiple secondary indexes on a table, referred to as the base table. There are two types of secondary indexes, local and global.

本地二级索引必须具有与基表相同的分区键和不同的排序键。本地索引在与其引用的项目相同的分区上构建和维护。本地索引读取和写入消耗分配给基表的容量单位。

A local secondary index must have the same partition key as the base table, and a different sort key. Local indexes are built and maintained on the same partition as the items to which they refer. Local index reads and writes consume the capacity units allocated to the base table.

全局二级索引可以具有与基表不同的主键和排序键。这意味着索引条目可以跨越表的所有分区,因此是全局术语。全局二级索引在其自己的分区中创建和维护,并且需要与基表分开配置容量。

Global secondary indexes can have different primary and sort keys to the base table. This means index entries can span all partitions for the table, hence the global terminology. A global secondary index is created and maintained in its own partition, and requires capacity to be provisioned separately from the base table.

对于数据访问,您可以在 DynamoDB 中选择两种 API。所谓的经典PutItemAPI 使用四个核心操作(即、GetItemDeleteItem和操作)的变体提供单项和多项 CRUD 功能UpdateItem。以下 Java 示例显示了一个GetItemAPI。skierID它检索由API 中指定的主键值标识的完整文档:

For data access, you have two choices for APIs in DynamoDB. The so-called classic API provides single- and multiple-item CRUD capabilities using variations of four core operations, namely PutItem, GetItem, DeleteItem, and UpdateItem operations. The following Java example shows a GetItem API. It retrieves the complete document identified by the skierID primary key value specified in the API:

表 table = dynamoDB.getTable("滑雪者");  
Item item = table.getItem("skierID", “6788321471”);
Table table = dynamoDB.getTable("Skiers");  
Item item = table.getItem("skierID", “6788321471”);

如果要同时读取或写入多个项目,可以使用BatchGetItemandBatchWriteItem操作。GetItem这些本质上是个人和PutItem// APIDeleteItem的包装UpdateItem。使用这些批处理版本的优点是所有请求都在单个 API 调用中提交。这减少了从客户端到 DynamoDB 的网络往返次数。由于 DynamoDB 并行执行每个单独的读取或写入操作,因此您的性能也会受益。

If you want to read or write to multiple items at the same time, you can use the BatchGetItem and BatchWriteItem operations. These are essentially wrappers around individual GetItem and PutItem/DeleteItem/UpdateItem APIs. The advantage of using these batch versions is that all the requests are submitted in a single API call. This reduces the number of network round trips from your client to DynamoDB. Your performance also benefits because DynamoDB executes each individual read or write operation in parallel.

最近可用的替代 API(称为 PartiQL)是一种源自 SQL 的方言。ExecuteStatement您可以使用和API提交 SQL 语句BatchExecuteStatement。DynamoDB 将您的 SQL 语句转换为经典 API 中定义的单独 API 调用。

The more recently available alternative API, known as PartiQL, is an SQL-derived dialect. You submit SQL statements using the ExecuteStatement and BatchExecuteStatement APIs. DynamoDB translates your SQL statements into individual API calls as defined in the classic API.

您还可以使用 API 获得 ACID 事务功能ExecuteTransaction。这使您能够将多个 CRUD 操作分组到表内和表间的多个项目,并保证全部成功,或者全部失败。在底层,DynamoDB 使用 2PC 算法来协调分布式分区之间的事务。

You also have ACID transaction capabilities using the ExecuteTransaction API. This enables you to group multiple CRUD operations to multiple items both within and across tables, with guarantees that all will succeed, or none will. Under the hood, DynamoDB uses the 2PC algorithm to coordinate transactions across distributed partitions.

交易对容量配置有影响。在配置模式下,每个事务将对事务中访问的每个数据项产生两次读取或写入。这意味着您必须相应地规划您的读写容量单位。如果足够的话预配置容量不可用于事务中访问的任何表,事务可能会失败。

Transactions have an impact on capacity provisioning. In provisioned mode, each transaction will incur two reads or writes to each data item accessed in the transaction. This means you have to plan your read and write capacity units accordingly. If sufficient provisioned capacity is not available for any of the tables accessed in the transaction, the transactions may fail.

分发和复制

Distribution and Replication

作为托管服务,DynamoDB 简化了数据从应用程序的角度来看分发和复制。您为项目定义一个分区键,DynamoDB 对该键进行哈希处理以存储每个项目的三个副本。为了增强可用性,托管每个分区的节点位于单个 AWS 区域内的不同可用区中。可用区被设计为独立于每个 AWS 区域内的其他可用区发生故障。

As a managed service, DynamoDB simplifies data distribution and replication from the application’s perspective. You define a partition key for items, and DynamoDB hashes the key to store three copies of every item. To enhance availability, the nodes that host each partition are in different availability zones within a single AWS region. Availability zones are designed to fail independently of others within each AWS region.

每个分区都有一个领导者和两个追随者。当您向项目发出更新请求时,当更新在领导者上持久化时,您会收到 HTTP 200 响应代码。然后更新异步传播到副本。

Each partition has a leader and two followers. When you issue an update request to an item, you receive an HTTP 200 response code when the update is made durable on the leader. Updates then propagate asynchronously to replicas.

默认情况下,读取操作可以访问任何副本,从而可能导致过时读取。如果您想确保读取到某项的最新值,可以将ConsistentRead读取接口中的参数设置为true。这会将写入定向到具有最新值的领导节点。强一致性读取比最终一致性读取消耗更多的容量单位,并且如果领导分区不可用,则可能会失败。

By default, read operations can access any replica, leading to the potential for stale reads. If you want to ensure you read the latest value of an item, you can set the ConsistentRead parameter in read APIs to true. This directs writes to the leader node, which has the latest value. Strongly consistent reads consume more capacity units than eventually consistent reads, and may fail if the leader partition is unavailable.

DynamoDB 管理您的分区,其自适应容量功能将在以下情况下自动重新分区数据,同时保持可用性

DynamoDB manages your partitions, and its adaptive capacity capabilities will automatically repartition data, while maintaining availability, under the following circumstances:

  • 分区超出了分区的大小限制,大约为 10 GB。

  • A partition exceeds the size limits for partitions, which is approximately 10 GB.

  • 您增加表的预配置吞吐量,需要的性能高于现有分区可以支持的性能。

  • You increase the provisioned throughput capacity for a table, requiring performance that is higher than the existing partitions can support.

  • 配置为使用按需容量的表会遇到请求激增,超出其能够维持的吞吐量。

  • A table configured to use on-demand capacity experiences a spike in requests that exceeds the throughput it is able to sustain.

默认情况下,DynamoDB 表驻留在单个 AWS 区域。AWS 区域与位于世界各地的称为数据中心的物理资源相关联。对于服务大规模、全球分布的用户群体的应用程序,如果请求必须长距离传输到 DynamoDB 数据库所在的区域,则延迟可能会令人望而却步。

By default, DynamoDB tables reside in a single AWS region. AWS regions are tied to physical resources known as data centers that are located in different places around the world. For applications that serve large-scale, globally distributed user populations, latencies can be potentially prohibitive if requests must travel long distances to the region where your DynamoDB database resides.

例如,假设本章前面的滑雪者管理系统在全球各地都有滑雪胜地,并使用位于美国西海岸地区的 DynamoDB 数据库(例如 us-west-1)。与北美滑雪者相比,欧洲和澳大利亚滑雪场的滑雪者访问系统的延迟要长得多。

As an example, imagine the skier management system from earlier in this chapter has ski resorts all over the globe, and uses a DynamoDB database located in the US west coast region (e.g., us-west-1). Skiers at European and Australian resorts would experience considerably longer latencies to access the system than those located in North America.

您可以通过部署来解决这些延迟使用 DynamoDB 全局表跨多个区域的表。全局表在多个 AWS 区域中维护额外的副本,并在您希望找到该表的所有区域中复制所有项目。在一个区域中进行的更新会异步传播到其他副本。您还需要支付每个区域的存储费用,从而增加整体应用程序成本。该方案如图13-5所示,全局表位于美国、印度和意大利。

You can address these latencies by deploying your tables across multiple regions using DynamoDB global tables. Global tables maintain additional replicas in multiple AWS regions, and replicate all items across all the regions you wish to locate the table. Updates made in one region propagate to other replicas asynchronously. You also pay storage charges at each region, increasing the overall application costs. This scheme is shown in Figure 13-5, with global tables located in the US, India, and Italy.

DynamoDB 全局表
图 13-5。DynamoDB 全局表

重要的是,全局表是多领导者的,这意味着您可以更新任何区域中的领导者副本。如果同一项目在两个区域中同时更新,则可能会产生冲突。在这种情况下,DynamoDB 使用最后写入者获胜冲突解决策略将副本聚合到单个值。

Importantly, global tables are multileader, meaning you can update the leader replica in any region. This creates the potential for conflicts if the same item is concurrently updated in two regions. In this case, DynamoDB uses a last writer wins conflict resolution strategy to converge replicas on a single value.

全局表有一些您需要注意的微妙限制。这些涉及强一致性读取和事务,它们都在单个区域的范围内运行:

Global tables have some subtle restrictions you need to be aware of. These concern strongly consistent reads and transactions, which both operate at the scope of a single region:

  • 强一致性读取会返回读取发生区域内的项目的最新值。如果同一项目键最近在另一个区域更新过,则不会返回该值。跨区域复制最新版本可能需要几秒钟的时间。

  • A strongly consistent read returns the latest value for an item within the region that the read takes place. If the same item key has been more recently updated in another region, this value will not be returned. It may take several seconds for the latest version to be replicated across regions.

  • 事务的 ACID 属性仅在处理事务的区域内得到保证。在此源区域中提交事务后,DynamoDB 会将生成的更新复制到其他区域。使用标准复制协议的更新流程,这意味着您可能会在目标区域看到部分更新同时应用事务中的所有更新。

  • The ACID properties of transactions are only guaranteed within the region that processes the transaction. Once the transaction has been committed in this source region, DynamoDB replicates the resulting updates to the other regions. The updates flow using the standard replication protocol, meaning you may see partial updates in destination regions while all the updates from the transaction are applied.

长处和短处

Strengths and Weaknesses

将 DynamoDB 的日益普及与 AWS 云的不断增长的使用分开并不容易。DynamoDB 作为强大的 AWS 工具和技术生态系统的一部分而存在。这样做的好处是相当大的。例如,AWS 使用 CloudWatch 为 DynamoDB 提供集成性能监控,并与 AWS Lambda 无服务器功能无缝集成。如果您要将系统部署到 AWS,DynamoDB 可能是持久层的绝佳候选者。当然,就像任何数据库一样,有些事情您需要仔细评估。与基于公共云的系统一样,您必须了解应用程序产生的成本。

It’s not easy to divorce the increasing popularity of DynamoDB from the ever-growing usage of the AWS Cloud. DynamoDB exists as part of the powerful AWS ecosystem of tools and technologies. The benefits of this can be considerable. For example, AWS provides integrated performance monitoring for DynamoDB using CloudWatch, and integrates seamlessly with AWS Lambda serverless functions. If you are deploying your systems to AWS, DynamoDB can be an excellent candidate for your persistence layer. Like any database of course, there are things you need to carefully assess. And as always with public cloud-based systems, you have to be aware of the costs your applications accrue.

表现

Performance

DynamoDB API 相对而言原始的,因此通常可以以非常低的延迟执行。您的数据模型还可以利用复合键和二级索引来提供对数据的高效访问。利用索引而不是执行表扫描的查询将执行得更快并消耗更少的容量单位,这也降低了成本。构建支持低延迟查询的适当数据模型无疑不是一个简单的练习8,并且需要小心以满足性能要求。您可以支付额外费用部署DynamoDB Accelerator (DAX) 内存缓存,以进一步减少查询延迟。

The DynamoDB APIs are relatively primitive and hence can be generally executed with very low latencies. Your data model can also exploit composite keys and secondary indexes to provide efficient access to your data. Queries that exploit indexes rather than performing table scans will execute faster and consume fewer capacity units, which also reduces costs. Crafting an appropriate data model that supports low latency queries is undoubtedly not a straightforward exercise8 and requires care to achieve performance requirements. At additional cost, you can deploy the DynamoDB Accelerator (DAX) in-memory cache that can further reduce query latencies.

数据安全

Data safety

更新被确认时领导者分区使修改持久,并且表中的所有项目都在本地区域的三个分区之间复制。使用全局表会增加复制因子,但如果在两个不同区域中同时更新同一项目,则可能会导致数据丢失。时间点和按需备份与 AWS 环境完全集成。

Updates are acknowledged when the leader partition makes the modification durable, and all items in tables are replicated across three partitions in the local region. Using global tables increases the replication factor, but does introduce the potential for data loss if the same item is concurrently updated in two different regions. Point-in-time and on-demand backups are fully integrated with the AWS environment.

可扩展性

Scalability

DynamoDB 的自适应容量旨在重新平衡大型数据库以提供足够的分区来满足观察到的需求。这为跨分区施加相对均匀负载的工作负载提供了出色的可扩展性。

DynamoDB’s adaptive capacity is designed to rebalance large databases to provide sufficient partitions to match observed demand. This provides excellent scalability for workloads that exert relatively even loads across partitions.

一个众所周知的问题与热键有关。预配置容量是按每个表分配的。这意味着如果您的应用程序有 10 个分区,则每个分区将接收整个表容量的十分之一。如果请求不成比例地访问少量热键,则托管这些项目的分区可能会消耗表的预配容量。这可能会导致请求因缺乏配置容量而被拒绝。

A well-known problem revolves around hotkeys. Provisioned capacity is allocated on a per-table basis. This means if your application has 10 partitions, each partition receives a tenth of the overall table capacity. If requests disproportionately access a small number of hot keys, the partitions that host those items can consume the provisioned capacity for the table. This can cause requests to be rejected due to a lack of provisioned capacity.

在极端情况下,自适应容量可能会创建一个分区来保存带有热键的单个项目。在这种情况下,对该项目的请求仅限于单个分区每秒可以提供 3,000 个读取容量单位或 1,000 个写入容量单位的最大吞吐量。

Adaptive capacity in extreme cases may create a partition that holds a single item with a hotkey. In this case, requests to the item are limited to the maximum throughput a single partition can deliver of 3,000 read capacity units or 1,000 write capacity units per second.

一致性

Consistency

副本最终是一致的,所以陈旧可以从非领导副本读取。您可以使用强一致性读取来获取最新的副本值,但代价是额外的容量单位使用和延迟。从全局索引中读取的数据始终最终一致。您还可以使用 ACID 事务来执行多项更新。9强一致性读取和事务都仅限于一个区域,因此不提供与全局表的一致性保证。

Replicas are eventually consistent, so stale reads from nonleader replicas are possible. You can obtain the latest replica value using strongly consistent reads at the cost of additional capacity unit usage and latency. Reads from global indexes are always eventually consistent. You can also use ACID transactions to perform multi-item updates.9 Both strongly consistent reads and transactions are scoped to a region and hence do not provide consistency guarantees with global tables.

可用性

Availability

DynamoDB 为用户提供服务级别协议 (SLA)。这基本上保证了全球99.999%的可用性表和单区域表的 99.99% 可用性。AWS 中断偶尔会发生;例如,2021 年 12 月的一次重大事件导致许多应用程序瘫痪,AWS 生态系统的一部分出现故障可能会导致您的数据不可用。这基本上是采用基于云的服务时所承担的风险,也是混合云和多云等部署策略变得越来越流行的原因。

DynamoDB provides users with a service-level agreement (SLA). This basically guarantees 99.999% availability for global tables and 99.99% availability for single-region tables. AWS outages do occur occasionally; for example, a major one brought down many applications in December 2021 and it’s possible a failure in a part of the AWS ecosystem could make your data unavailable. It’s basically a risk you take when you adopt a cloud-based service, and the reason that deployment strategies like hybrid and multicloud are becoming more and more popular.

总结和延伸阅读

Summary and Further Reading

在本章中,我描述了三个著名的 NoSQL 数据库(即 Redis、MongoDB 和 DynamoDB)的一些主要架构特征。每个平台本身都是一个强大的分布式平台,拥有庞大的用户社区。在幕后,实现方式差异很大。这会影响您对每个平台上构建的应用程序所期望的性能、可扩展性、可用性和一致性。

In this chapter, I’ve described some of the major architectural features of three prominent NoSQL databases, namely Redis, MongoDB, and DynamoDB. Each is a powerful distributed platform in its own right, with large user communities. Underneath the hood, the implementations vary considerably. This affects the performance, scalability, availability, and consistency you can expect from applications built on each platform.

Redis 更注重原始性能和简单性,而不是数据安全性和一致性。MongoDB 拥有更丰富的功能集,适合需要未来增长的广泛业务应用程序。DynamoDB 是一项完全托管的服务,支持低延迟键值查找。它深度集成到AWS云基础设施中,提供自动可扩展性和可用性保证。同样,您可以使用主要云供应商支持的 MongoDB 和 Redis(以及其他几个数据库)的云托管实现来简化您的操作和管理。

Redis favors raw performance and simplicity over data safety and consistency. MongoDB has a richer feature set and is suited to a broad range of business applications that require future growth. DynamoDB is a fully managed service and supports low-latency key-value lookups. It is deeply integrated into the AWS Cloud infrastructure, providing automatic scalability and availability guarantees. Similarly, you can use cloud-hosted implementations of both MongoDB and Redis (and several other databases) that are supported by major cloud vendors to simplify your operations and management.

实际上,没有完美的解决方案或方法来选择分布式数据库来满足您的应用程序需求。即使对于少数候选平台,也有太多的维度和功能无法彻底评估。大多数时候你能做的最好的事情就是进行认真的尽职调查,理想情况下构建一个技术验证原型,让你可以测试一个或两个平台。总会有意想不到的障碍让你咒骂你所选择的平台。恐怕大规模软件工程是一种不完美的实践,但如果深入了解所涉及的问题,通常可以避免大多数灾难!

In reality, there’s no perfect solution or approach for choosing a distributed database to match your application needs. There are simply too many dimensions and features to thoroughly evaluate even for a small number of candidate platforms. The best you can do most of the time is serious due diligence, and ideally build a proof-of-technology prototype that lets you test-drive one or two platforms. There will always be unexpected roadblocks that make you curse your chosen platform. Software engineering at scale is an imperfect practice, I’m afraid, but with deep knowledge of the issues involved, you can usually avoid most disasters!

对于一本对分布式数据库系统有很好的覆盖(广度和深度)的书来说,《分布式数据库系统原理》,第四版。(Springer,2020 年)M. Tamer Özsu 和 Patrick Valduriez 所著的一本值得放在您书架上的书。

For a book with excellent coverage (both breadth and depth) of distributed database systems, Principles of Distributed Database Systems, 4th ed. (Springer, 2020) by M. Tamer Özsu and Patrick Valduriez is one to have on your bookshelf.

highscalability.com是深入了解互联网上一些最大的系统如何运行的绝佳场所。例如,最近的帖子描述了 Tinder 的设计,它在整个技术集合中使用了 DynamoDB,以及Instagram基于 Cassandra 和 Neo4j 构建的

An excellent place for gaining insights into how some of the largest systems on the internet operate is highscalability.com. For example, recent posts describe the design of Tinder, which uses DynamoDB among a whole collection of technologies, and Instagram, built upon Cassandra and Neo4j.

最后,大规模管理分布式数据库的复杂性正促使许多企业使用 DynamoDB 等托管服务。许多流行数据库正在出现提供同等“无服务器数据库”功能的平台。例如 MongoDB Atlas、Astra DB for Cassandra 和 Yugabyte Cloud。

Finally, the complexity of managing distributed databases at scale is driving many businesses to use managed services such as DynamoDB. Platforms providing equivalent “serverless database” capabilities are emerging for many popular databases. Examples are MongoDB Atlas, Astra DB for Cassandra, and Yugabyte Cloud.

1 Emil Koutanov 对Redis 中的数据安全性进行了精彩的分析,更加详细。

1 Emil Koutanov goes into more detail with an excellent analysis of data safety in Redis.

2 MMAPv1 在 MongoDB 版本 4.0 中已弃用。您可以在https://oreil.ly/uWiNx找到其文档。

2 MMAPv1 was deprecated in MongoDB version 4.0. You can find its documentation at https://oreil.ly/uWiNx.

3 Percona 博客上可以找到两种文件系统的详细比较。

3 A good comparison of the two file systems can be found on the Percona blog.

4 David Murphy 的这篇博文阐述了如何通过较短的字段名称将文档大小减少 25%。

4 This blog post by David Murphy illustrates how a 25% reduction in document size can be achieved with shorter field names.

5 Jepsen 报告提供了关于 MongoDB 一致性的稍微过时但仍然引人入胜的分析。

5 A slightly out-of-date but still fascinating analysis of MongoDB consistency is provided by this Jepsen report.

6 G.Decandia 等人。“Dynamo:亚马逊高度可用的键值存储。”。第二十一届 ACM SIGOPS 操作系统原理研讨会论文集 — SOSP '07,第 12 页。205.美国纽约州纽约:ACM。

6 G. Decandia et al. “Dynamo: Amazon’s Highly Available Key-Value Store.”. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles—SOSP ’07, p. 205. New York, NY, USA: ACM.

7有关排序键威力的更多示例,请参阅https://oreil.ly/5G5le

7 For more examples of the power of sort keys, see https://oreil.ly/5G5le.

AWS 文档中描述了8数据建模的最佳实践。

8 Best practices for data modeling are described in the AWS documentation.

9事务隔离级别的说明位于https://oreil.ly/kDRnC

9 An explanation of transaction isolation levels is at https://oreil.ly/kDRnC.

第四部分。事件和流处理

Part IV. Event and Stream Processing

第四部分介绍了大规模处理流事件的架构和技术。基于事件的系统提出了自己独特的挑战。它们需要可靠、高效地捕获和持久保存大量事件流的技术。您还需要工具来支持从事件流的最新快照计算部分结果(想想 Twitter 中的热门话题),并具有实时功能和处理节点故障的容忍度。我将解释所需的架构方法,并使用广泛部署的 Apache Kafka 和 Flink 开源技术来说明解决方案。

Part IV switches gears and describes architectures and technologies for processing streaming events at scale. Event-based systems pose their own unique challenges. They require technologies for reliably and efficiently capturing and persisting high-volume event streams. You also need tools to support calculating partial results from the most recent snapshots of the event stream (think trending topics in Twitter), with real-time capabilities and tolerance of processing node failures. I’ll explain the architectural approaches required and illustrate solutions using the widely deployed Apache Kafka and Flink open source technologies.

第14章。可扩展的事件驱动处理

Chapter 14. Scalable Event-Driven Processing

第 7 章中,我描述了异步消息传递系统的优点和基本原语。通过利用消息传递系统进行通信,您可以创建松散耦合的体系结构。消息生产者只是将消息存储在队列中,而不关心消费者如何处理它。可以有一个或多个消费者,并且生产者和消费者的集合可以随着时间的推移而演变。这为您带来了巨大的架构灵活性,并有利于提高服务响应能力、通过缓冲消除请求到达峰值,以及在面对不可用的消费者时维持系统处理。

In Chapter 7, I described the benefits and basic primitives of asynchronous messaging systems. By utilizing a messaging system for communications, you can create loosely coupled architectures. Message producers simply store a message on a queue, without concern about how it is processed by consumers. There can be one or many consumers, and the collection of producers and consumers can evolve over time. This buys you immense architectural flexibility and has benefits in improving service responsiveness, smoothing out request arrival spikes through buffering, and maintaining system processing in the face of unavailable consumers.

传统上,用于实现异步系统的消息代理技术主要关注消息传输。RabbitMQ 或 ActiveMQ 等代理平台支持用作基于 FIFO 的临时内存或基于磁盘的存储的队列集合。当消费者访问队列中的消息时,该消息将从代理中删除。这是被称为破坏性消费者语义。如果使用发布-订阅消息传递,代理将实现在队列中维护消息的机制,直到所有活动订阅者都使用了每条消息。新订阅者看不到旧消息。代理通常还实现一些用于消息过滤和路由的附加功能。

Traditionally, the message broker technologies used to implement asynchronous systems focus on message transit. A broker platform such as RabbitMQ or ActiveMQ supports collections of queues that are used as temporary FIFO-based memory or disk-based storage. When a consumer accesses a message from a queue, the message is removed from the broker. This is known as destructive consumer semantics. If publish-subscribe messaging is used, brokers implement mechanisms to maintain messages in queues until all active subscribers have consumed each message. New subscribers do not see old messages. Brokers also typically implement some additional features for message filtering and routing.

在本章中,我将通过事件驱动架构的视角重新审视异步系统。事件驱动系统对于可扩展的分布式应用程序具有一些有吸引力的功能。我将简要解释这些吸引力,然后重点介绍 Apache Kafka 平台。Kafka 旨在利用简单的持久消息日志数据结构和非破坏性消费者语义,大规模支持事件驱动的系统。

In this chapter I’m going to revisit asynchronous systems through the lens of event-driven architectures. Event-driven systems have some attractive features for scalable distributed applications. I’ll briefly explain these attractions, and then focus on the Apache Kafka platform. Kafka is designed to support event-driven systems at scale, utilizing a simple persistent message log data structure and nondestructive consumer semantics.

事件驱动架构

Event-Driven Architectures

事件代表着应用程序上下文中发生了一些有趣的事情。这可能是一个系统捕获的外部事件,或由于某些状态更改而内部生成的事件。例如,在包裹运输应用程序中,当包裹到达新位置时,条形码扫描会生成包含包裹标识符、位置和时间的事件。汽车租赁系统中管理驾驶员详细信息的微服务在检测到驾驶员执照已过期时可能会发出事件。这两个示例都演示了如何使用事件进行通知。事件源只是发出事件,并且不期望系统中的其他组件如何处理该事件。

Events represent that something interesting has happened in the application context. This might be an external event that is captured by the system, or an internally generated event due to some state change. For example, in a package shipping application, when a package arrives at a new location, a barcode scan generates an event containing the package identifier, location, and time. A microservice in a car hire system that manages driver details could emit an event when it detects a driver’s license has expired. Both these examples demonstrate using events for notifications. The event source simply emits the event and has no expectations on how the event might be processed by other components in the system.

事件通常发布到消息传递系统。有兴趣的各方可以注册接收事件并进行相应处理。包裹运输条形码扫描可能会被微服务使用,该微服务向等待包裹的客户发送文本。另一个微服务可能会更新包裹的交付状态,并注明其当前位置。许可证过期事件可用于向驾驶员发送电子邮件,提醒他们更新信息。重要的是事件源不知道事件生成所触发的操作。由此产生的架构是松散耦合的,并为合并新的事件消费者提供了高度的灵活性。

Events are typically published to a messaging system. Interested parties can register to receive events and process them accordingly. A package shipping barcode scan might be consumed by a microservice that sends a text to the customer awaiting the package. Another microservice might update the package’s delivery state, noting its current location. The expired license event may be utilized to send the driver an email to remind them to update their information. The important thing is that the event source is oblivious to the actions that are triggered by event generation. The resulting architecture is loosely coupled and affords high levels of flexibility for incorporating new consumers of events.

您可以使用RabbitMQ 的发布/订阅功能等消息传递系统来实现基于事件的架构。一次每个订阅者都消费了一个事件,该事件将从代理中删除。这释放了代理资源,但也具有销毁事件的任何显式记录的效果。

You can implement an event-based architecture using messaging systems like RabbitMQ’s publish/subscribe features. Once every subscriber has consumed an event, the event is removed from the broker. This frees up broker resources, but also has the effect of destroying any explicit record of the event.

事实证明,在简单的日志数据结构中保存不可变事件的永久记录具有一些有用的特性。与大多数消息代理管理的 FIFO 队列相反,事件日志是一种仅附加的数据结构,如图14-1所示。记录附加到日志末尾,每个日志条目都有一个唯一的条目号。序列号明确地捕获系统中事件的顺序。具有较低序列号的事件被定义为在具有较高序列号的条目之前发生。此顺序在分布式系统中特别有用,可用于产生有用的应用程序见解和行为。

It turns out that keeping a permanent record of immutable events in a simple log data structure has some useful characteristics. In contrast to FIFO queues managed by most message brokers, an event log is an append-only data structure, as shown in Figure 14-1. Records are appended to the end of the log and each log entry has a unique entry number. The sequence numbers explicitly capture the order of events in the system. Events with a lower sequence number are defined to have occurred before entries with a higher sequence number. This order is especially useful in distributed systems and can be exploited to produce useful application insights and behaviors.

日志数据结构
图 14-1。日志数据结构

例如,在包裹运输示例中,您可以处理日志以发现任意时刻每个位置的包裹数量,以及包裹在加载到下一交付阶段之前在该位置停留的持续时间。如果包裹丢失或延误,您可以生成另一个事件来触发一些补救措施,使包裹再次移动。这些分析变得很容易实现,因为日志是有关每个包在任何时刻现在(和过去)位置的唯一事实来源。

For example, in the package shipping example, you could process the log to discover the number of packages at each location at any instant, and the duration that packages reside at locations before being loaded onto the next stage of delivery. If a package gets misplaced or delayed, you can generate another event to trigger some remedial action to get a package moving again. These analyses become straightforward to implement as the log is the single source of truth about where every package is (and was) at any instant.

基于事件的系统的另一个常见用例是跨微服务保持复制数据同步。例如,制造商可以通过向目录微服务发送更新请求来更改产品名称。在内部,该微服务更新其本地数据存储中的产品名称,并向与应用程序中的其他微服务共享的事件日志发出事件。任何存储产品详细信息的微服务都可以读取该事件并更新其自己的产品名称副本。如图14-2所示,事件日志本质上是用于跨微服务的复制,以实现状态传输。

Another common use case for event-based systems is keeping replicated data synchronized across microservices. For example, a manufacturer might change the name of a product by sending an update request to the Catalog microservice. Internally, this microservice updates the product name in its local data store and emits an event to an event log shared with other microservices in the application. Any microservice that stores product details can read the event and update its own copy of the product name. As shown in Figure 14-2, the event log is essentially being used for replication across microservices to implement state transfer.

使用事件日志跨微服务复制状态更改
图 14-2。使用事件日志跨微服务复制状态更改

事件日志的持久性具有一些关键优势:

The persistent nature of the event log has some key advantages:

  • 您可以介绍新活动消费者随时。日志存储永久的、不可变的事件记录,新的消费者可以访问完整的事件历史记录。它可以处理现有事件和新事件。

  • You can introduce new event consumers at any time. The log stores a permanent, immutable record of events and a new consumer has access to this complete history of events. It can process both existing and new events.

  • 您可以修改现有的事件处理逻辑,以添加新功能或修复错误。然后,您可以对完整日志执行新逻辑以丰富结果或修复错误。

  • You can modify existing event-processing logic, either to add new features or fix bugs. You can then execute the new logic on the complete log to enrich results or fix errors.

  • 如果服务器或磁盘发生故障,您可以恢复上次已知的状态并从日志中重播事件来恢复数据集。这类似于数据库系统中事务日志的作用。

  • If a server or disk failure occurs, you can restore the last known state and replay events from the log to restore the data set. This is analogous to the role of the transaction log in database systems.

与所有事物一样,不可变的、仅附加的日志也有缺点。我在下面的边栏中简要描述了其中之一:删除事件以及 Apache Kafka 的相关功能。您可以阅读更多有关设计事件驱动架构和模式(例如事件协作和事件溯源)的内容。我会给你推荐几个优秀的来源见“摘要和延伸阅读”。然而,在本章的剩余部分,我想探讨 Apache Kafka 平台的功能。

As with all things, there are downsides to immutable, append-only logs. I briefly describe one of these, deleting events, and Apache Kafka’s related capabilities in the following sidebar. You can read an awful lot more about designing event-driven architectures and patterns such as event collaboration and event sourcing. I’ll point you to several excellent sources in “Summary and Further Reading”. For the remainder of this chapter, however, I want to explore the features of the Apache Kafka platform.

阿帕奇·卡夫卡

Apache Kafka

Kafka 的核心是一个分布式持久日志存储。Kafka 采用通常称为“哑代理/智能客户端”的架构。经纪商的主要能力围绕有效地将新事件附加到持久日志中,将事件传递给使用者,并管理日志分区和复制以实现可扩展性和可用性。日志条目被持久存储,并且可以被多个使用者多次读取。消费者只需指定他们希望读取的条目的日志偏移量或索引。这使得代理无需维护任何复杂的与消费者相关的状态。

At its core, Kafka is a distributed persistent log store. Kafka employs what is often called a dumb broker/smart clients architecture. The broker’s main capabilities revolve around efficiently appending new events to persistent logs, delivering events to consumers, and managing log partitioning and replication for scalability and availability. Log entries are stored durably and can be read multiple times by multiple consumers. Consumers simply specify the log offset, or index, of the entries they wish to read. This frees the broker from maintaining any complex consumer-related state.

由此产生的架构已被证明具有令人难以置信的可扩展性并提供非常高的吞吐量。由于这些原因,Kafka 已成为现代系统中使用最广泛的开源消息传递平台之一。

The resulting architecture has proven to be incredibly scalable and to provide very high throughput. For these reasons, Kafka has become one of the most widely used open source messaging platforms in use in modern systems.

Kafka 起源于 LinkedIn努力简化他们的系统集成工作。1它于 2012 年迁移成为Apache 项目。Kafka 代理,它是以下小节的重点,是一系列相关技术的核心。这些都是:

Kafka originated at LinkedIn from efforts to streamline their system integration efforts.1 It migrated to become an Apache project in 2012. The Kafka broker, which is the focus of the following subsections, sits at the core of a suite of related technologies. These are:

卡夫卡连接
Kafka Connect
这是一个设计用于构建连接器以将外部数据系统链接到 Kafka 代理。您可以使用该框架构建高性能连接器,从您自己的系统生成或使用 Kafka 消息。多个供应商还为您大多数人可能想到的几乎任何数据管理系统提供预制连接器!2
This is a framework designed for building connectors to link external data systems to the Kafka broker. You can use the framework to build high-performance connectors that produce or consume Kafka messages from your own systems. Multiple vendors also provide prefabricated connectors for pretty much any data management system most of you can probably think of!2
卡夫卡流
Kafka Streams
这是一个轻量级客户端库根据 Kafka 代理中存储的事件构建流应用程序。数据流代表无限的、不断更新的数据集。流媒体应用程序通过批量或时间窗口处理数据来提供有用的实时见解。例如,超市可以处理传入的商品购买流,以发现过去一小时内销量最高的商品。这可用于触发意外快速销售的商品的重新订购或补货。流应用程序和平台是我在第 15 章中深入讨论的主题,因此我不会在这里返回 Kafka Streams。
This is a lightweight client library for building streaming applications from events stored in the Kafka broker. A data stream represents an unbounded, continuously updating data set. Streaming applications provide useful real-time insights by processing data in batches or time windows. For example, a supermarket may process a stream of incoming item purchases to discover the highest selling items in the last hour. This could be used to trigger reordering or restocking of items that are unexpectedly selling quickly. Streaming applications and platforms are the topic I cover in depth in Chapter 15, so I won’t return to Kafka Streams here.

Kafka支持高度分布式集群部署,其中代理进行通信以分发和复制事件日志。这需要管理集群元数据,它本质上指定了多个事件日志在集群中的位置,以及集群状态的各种其他元素。Kafka 将元数据管理委托给Apache ZooKeeper

Kafka supports highly distributed cluster deployments in which brokers communicate to distribute and replicate event logs. This requires management of cluster metadata, which essentially specifies where the multiple event logs live in the cluster, and various other elements of cluster state. Kafka delegates this metadata management to Apache ZooKeeper.

ZooKeeper 是一项高度可用的服务,许多分布式平台都使用它来管理配置信息和支持小组协调。ZooKeeper 提供了一个类似于普通文件系统的分层命名空间,Kafka 使用它在外部维护集群状态,使其可供所有代理使用。这意味着您必须创建一个 ZooKeeper 集群(为了可用性)并使其可供 Kafka 集群中的代理访问。3之后,Kafka 对 ZooKeeper 的使用对于您的应用程序来说是透明的。

ZooKeeper is a highly available service that is used by many distributed platforms to manage configuration information and support group coordination. ZooKeeper provides a hierarchical namespace similar to a normal filesystem that Kafka uses to maintain the cluster state externally, making it available to all brokers. This means you must create a ZooKeeper cluster (for availability) and make this accessible to the brokers in your Kafka cluster.3 After that, Kafka’s use of ZooKeeper is transparent to your application.

主题

Topics

卡夫卡主题相当于通用消息传递技术中的队列。在 Kafka 中,主题由代理管理,并且始终是持久的或持久的。一个或多个生产者将事件发送到一个主题。主题被实现为仅附加日志,这意味着新事件始终写入日志的末尾。消费者通过指定他们想要访问的主题的名称以及他们想要读取的消息的索引或偏移量来读取事件。

Kafka topics are the equivalent of queues in general messaging technologies. In Kafka, topics are managed by a broker and are always persistent, or durable. One or more producers send events to a topic. Topics are implemented as append-only logs, meaning new events are always written to the end of the log. Consumers read events by specifying the name of the topic they wish to access and the index, or offset, of the message they want to read.

从主题中读取事件是非破坏性的。每个主题都会保留所有事件,直到特定于主题的可配置事件保留期到期。当事件的存储时间超过此保留期限时,它们会自动从主题中删除。

Reading an event from a topic is nondestructive. Each topic persists all events until a topic-specific configurable event retention period expires. When events have been stored for longer than this retention period, they are automatically removed from the topic.

代理利用日志的仅附加性质来利用磁盘的线性读写性能能力。操作系统针对这些数据访问模式进行了大量优化,并使用数据预取和缓存等技术。这使得 Kafka 能够提供恒定的访问时间,无论主题中存储的事件数量如何。

Brokers take advantage of the append-only nature of logs to exploit the linear read and write performance capabilities of disks. Operating systems are heavily optimized for these data access patterns, and use techniques such as prefetching and caching of data. This enables Kafka to provide constant access times regardless of the number of events stored in a topic.

回到第13章的滑雪者管理系统示例,图14-3显示了一个 Kafka 代理,它支持三个主题,用于捕获来自三个不同滑雪场的滑雪缆车乘坐事件。每次滑雪者乘坐缆车时,都会生成一个事件,并由 Kafka 生产者写入该度假村的相应主题。消费者可以阅读主题中的事件来更新滑雪者的个人资料、发送高流量升降机警报以及各种其他有用的分析功能涉及滑雪场管理业务。

Returning to the skier management system example from Chapter 13, Figure 14-3 shows a Kafka broker that supports three topics used to capture ski lift ride events from three different ski resorts. Each time a skier rides a lift, an event is generated and written to the corresponding topic for that resort by a Kafka producer. Consumers can read events from the topic to update the skier’s profile, send alerts for high-traffic lifts, and various other useful analytical functions related to the ski resort management business.

一个 Kafka 经纪人管理三个滑雪场的主题
图 14-3。一个 Kafka 经纪人管理三个滑雪场的主题

生产者和消费者

Producers and Consumers

Kafka 为生产者提供 API 来编写事件,也为消费者从主题中读取事件提供 API。事件具有应用程序定义的键和关联值,以及发布者提供的时间戳。对于乘坐缆车事件,键可能是滑雪者 ID,值将嵌入滑雪者 ID和滑雪者乘坐缆车时的时间戳。然后,发布者会将事件发送到适当度假村的主题。

Kafka provides APIs for both producers to write events and consumers to read events from a topic. An event has an application-defined key and an associated value, and a publisher-supplied timestamp. For a lift ride event, the key might be the skierID and the value would embed the skiLiftID and a timestamp for when the skier rode the lift. The publisher would then send the event to the topic for the appropriate resort.

Kafka 生产者将事件发送给经纪人异步地。调用该pro⁠ducer​.send()操作会导致事件被写入生产者中的本地缓冲区。生产者创建一批待处理事件,直到触发一对可配置参数之一。然后,整个事件批次将在一个网络请求中发送。例如,一旦批处理大小超过指定值(例如,256 K)或某个延迟限制(例如,5 毫秒)到期,您可以使用这些参数将批处理发送到代理。图 14-4中对此进行了说明,并说明了如何使用Properties对象设置这些配置参数值。生产者在本地缓冲区中为他们将事件传递到的每个主题构建独立的批次。批次将保留在缓冲区中,直到代理成功确认为止。

Kafka producers send events to brokers asynchronously. Calling the pro⁠ducer​.send() operation causes the event to be written to a local buffer in the producer. Producers create batches of pending events until one of a configurable pair of parameters is triggered. The whole event batch is then sent in one network request. You can, for example, use these parameters to send the batch to the broker as soon as the batch size exceeds a specified value (e.g., 256 K) or some latency bound (e.g., 5 ms) expires. This is illustrated in Figure 14-4 along with how to set these configuration parameter values using a Properties object. Producers build independent batches in local buffers for each topic they deliver events to. Batches are maintained in the buffer until they are successfully acknowledged by the broker.

卡夫卡制作人
图 14-4。卡夫卡制作人

批量累积事件使 Kafka 能够减少与代理传递事件的网络往返次数。它还使代理能够在将事件批次附加到主题时执行更少、更大的写入。这些效率措施共同决定了 Kafka 系统能够实现的高吞吐量。缓冲生产者上的事件允许您权衡累积批次(值linger.ms)时产生的额外延迟,以提高系统吞吐量。

Accumulating events in batches enables Kafka to incur less network round trips to the broker to deliver events. It also enables the broker to perform fewer, larger writes when appending event batches to the topic. Together, these efficiency measures are responsible for much of the high throughput that a Kafka system can achieve. Buffering events on producers allows you to trade off the additional latency that is incurred while batches are accumulated (the linger.ms value) for improved system throughput.

以下代码片段显示了一个简单的方法,该方法将滑雪缆车乘坐事件发送到代理上代表度假村的主题。该send()方法返回 aFuture类型RecordMetaData。调用Future.get()将阻塞,直到代理将事件附加到主题并返回一个RecordMetaData对象。这包含有关日志中事件的信息,例如时间戳和偏移量:

The following code snippet shows a simple method that sends a ski lift ride event to a topic that represents the resort on the broker. The send() method returns a Future of type RecordMetaData. Calls to Future.get() will block until the broker has appended the event to the topic and returns a RecordMetaData object. This contains information about the event in the log such as its timestamp and offset:

公共Future <RecordMetadata> sendToBroker(最终字符串skierID,最终字符串
                                                 电梯骑行事件) {

       // 为简洁起见,省略了 Producer 和 ResortTopic 的初始化
       最终 ProducerRecord<String, String> ProducerRecord = new    
          ProducerRecord<>(resortTopic, 滑雪者ID, liftRideEvent);  
       返回 Producer.send( ProducerRecord );                 
}
public Future<RecordMetadata> sendToBroker(final String skierID, final String 
                                                 liftRideEvent) {

       // initialization of producer and resortTopic omitted for brevity
       final ProducerRecord<String, String> producerRecord = new    
          ProducerRecord<>(resortTopic, skierID, liftRideEvent);  
       return producer.send(producerRecord);                 
}

Kafka支持不同的事件传递保证通过acks配置参数供生产者使用。零值不提供交付保证。这是一个“一劳永逸”的选项——事件可能会丢失。值为 1 意味着一旦事件被持久化到目标主题,代理就会确认该事件。瞬时网络故障可能会导致生产者重试失败的事件,从而导致重复。如果不能接受重复,可以将enable-idempotence配置参数设置为true。这会导致代理过滤掉重复事件并提供一次性传递语义。

Kafka supports different event delivery guarantees for producers through the acks configuration parameter. A value of zero provides no delivery guarantee. This is a “fire-and-forget” option—events can be lost. A value of one means an event will be acknowledged by the broker once it has been persisted to the destination topic. Transient network failures may cause the producer to retry failed events, leading to duplicates. If you can’t accept duplicates, you can set the enable-idempotence configuration parameter to true. This causes the broker to filter out duplicate events and provide exactly-once delivery semantics.

Kafka 消费者利用拉模型从主题中批量检索事件。当消费者首次订阅某个主题时,其偏移量将设置为日志中的第一个事件。然后,您可以在事件循环中调用poll()使用者对象的方法。该poll()方法返回从当前偏移量开始的一个或多个事件。与生产者类似,您可以使用配置参数来调整消费者吞吐量,这些配置参数指定消费者等待事件可用的时间以及每次调用返回的事件数量poll()

Kafka consumers utilize the pull model to retrieve events in batches from a topic. When a consumer first subscribes to a topic, its offset is set to the first event in the log. You then call the poll() method of the consumer object in an event loop. The poll() method returns one or more events starting from the current offset. Similarly to producers, you can tune consumer throughput using configuration parameters that specify how long a consumer waits for events to be available and the number of events returned on each call to poll().

以下简单的使用者代码示例显示了一个检索并处理一批事件的事件循环:

The following simple consumer code example shows an event loop that retrieves and processes a batch of events:

当(活着){
  ConsumerRecords<K, V> liftRideEvents = Consumer.poll(LIFT_TOPIC_TIMEOUT);
  分析(liftRideEvents);
  Consumer.commitSync();
}
while (alive) {
  ConsumerRecords<K, V> liftRideEvents = consumer.poll(LIFT_TOPIC_TIMEOUT);
  analyze(liftRideEvents); 
  consumer.commitSync();
}

Kafka 增加消费者的偏移量主题自动指向主题中的下一个未处理的事件。默认情况下,Kafka 将自动提交该值,以便下一个获取事件的请求将从新的偏移量开始。提交消息实际上是作为poll()方法的一部分发送的,这会提交前一个poll()请求返回的偏移量。如果您的消费者在处理这批事件时失败,则不会提交偏移量,因为poll()不会调用。这为您的消费者提供了至少一次交付保证,因为下一次获取将以与上一次获取相同的偏移量开始。

Kafka increments the consumer’s offset in the topic automatically to point to the next unprocessed event in the topic. By default Kafka will automatically commit this value such that the next request to fetch events will commence at the new offset. The commit message is actually sent as part of the poll() method, and this commits the offset returned by the previous poll() request. Should your consumer fail while processing the batch of events, the offset is not committed as poll() is not called. This gives your consumer at-least-once delivery guarantees, as the next fetch will start at the same offset as the previous one.

您还可以选择在消费者中手动提交偏移量。您可以通过调用consumer.commitSync()API 来完成此操作,如示例中所示。commitSync()如果您在批量处理事件之前调用,则将提交新的偏移量。这意味着如果消费者在处理事件批次时失败,则该批次将不会被重新传递。您的消费者现在最多可以享受一次送货保证。

You can also choose to manually commit the offset in consumers. You do this by calling the consumer.commitSync() API, as shown in the example. If you call commitSync() before you process the events in a batch, the new offset will be committed. This means if the consumer fails while processing the event batch, the batch will not be redelivered. Your consumers now have at-most-once delivery guarantees.

commitSync()有了后打电话批量处理所有事件,如示例所示,为您的消费者提供至少一次交付保证。如果您的消费者在处理一批事件时崩溃,则偏移量将不会被提交,并且当消费者重新启动时,事件将被重新传递。消费者还可以随时使用consumer.seek(topic, offset)API 显式设置主题的偏移量。

Calling commitSync() after you have processed all the events in a batch, as in the example, gives your consumers at-least-once delivery guarantees. If your consumer crashes while processing a batch of events, the offset will not be committed and when the consumer restarts the events will be redelivered. Consumers can also at any time explicitly set the offset for the topic using the consumer.seek(topic, offset) API.

请注意,Kafka 消费者 API 不是线程安全的。与代理的所有网络交互都发生在同一客户端中检索事件的线程。为了同时处理事件,消费者需要实现线程方案。一种常见的方法是每个消费者线程模型,它提供了一个简单的解决方案,但代价是在代理上管理更多 TCP 连接和获取请求。另一种方法是让单个线程获取事件并将事件处理卸载到处理线程池。这可能提供更大的可扩展性,但使手动提交事件变得更加复杂,因为线程需要以某种方式进行协调,以确保在发出提交之前处理某个主题的所有事件。

Note the Kafka consumer API is not thread safe. All network interactions with the broker occur in the same client thread that retrieves events. To process events concurrently, the consumer needs to implement a threading scheme. A common approach is a thread-per-consumer model, which provides a simple solution at the cost of managing more TCP connections and fetch requests at the broker. An alternative is to have a single thread fetch events and offload event processing to a pool of processing threads. This potentially provides greater scalability, but makes manually committing events more complex as the threads somehow need to coordinate to ensure all events are processed for a topic before a commit is issued.

可扩展性

Scalability

主要的可扩展性机制Kafka是主题分区。创建主题时,您可以指定用于存储事件的分区数量,Kafka 会在集群中的代理之间分配分区。这提供了水平可扩展性,因为生产者和消费者分别可以并行写入和读取不同的分区。

The primary scalability mechanism in Kafka is topic partitioning. When you create a topic, you specify the number of partitions that should be used for storing events and Kafka distributes partitions across the brokers in a cluster. This provides horizontal scalability, as producers and consumers respectively can write to and read from different partitions in parallel.

当生产者启动时,您可以指定主机/端口对的列表以使用该对象连接到集群Properties,如以下 Java 代码片段所示:

When a producer starts, you specify a list of host/port pairs to connect to the cluster using the Properties object, as shown in the following Java snippet:

属性 props = new Properties();
props.put("bootstrap.servers", "IPbroker1,IPBroker2");
Properties props = new Properties();
props.put("bootstrap.servers", "IPbroker1,IPBroker2");

生产者连接到这些服务器,以根据代理 IP 地址以及哪些分区分配给哪些代理来发现集群配置。

The producer connects to these servers to discover the cluster configuration in terms of broker IP addresses and which partitions are allocated to which brokers.

与“哑巴经纪人”架构保持一致在 Kafka 实现中,生产者(而不是代理)负责选择事件分配到的分区。这使代理能够专注于接收、存储和传递事件的主要目的。默认情况下,您的生产者使用DefaultPartitionerKafka API 提供的类。

In tune with the “dumb broker” architecture that Kafka implements, producers, not the broker, are responsible for choosing the partition that an event is allocated to. This enables the broker to focus on its primary purpose of receiving, storing, and delivering events. By default, your producers use the DefaultPartitioner class provided by the Kafka API.

如果您不指定事件键(即键为null),则会DefaultPartitioner以循环方式向主题分区发送批量消息。当您指定事件键时,分区器会使用键值的哈希函数来选择分区。这会将具有相同键的事件定向到同一分区,这对于以聚合方式处理事件的使用者非常有用。例如,在滑雪场系统中,您可以使用 aliftID作为键来确保同一滑雪场同一电梯上的所有电梯乘坐事件都发送到同一分区。或者您可以使用它skierID来确保同一滑雪者的所有缆车行程都发送到同一分区。这通常称为语义分区。

If you do not specify an event key (i.e., the key is null), the DefaultPartitioner sends batches of messages to topic partitions in a round-robin fashion. When you specify an event key, the partitioner uses a hash function on the key value to choose a partition. This directs events with the same key to the same partition, which can be useful for consumers that process events in aggregates. For example, in the ski resort system, you could use a liftID as a key to ensure all lift ride events on the same lift at the same resort are sent to the same partition. Or you could use skierID to ensure all lift rides for the same skier are sent to the same partition. This is commonly called semantic partitioning.

主题分区会对事件排序产生影响。Kafka 将按照生产者生成事件的顺序将事件写入单个分区,并且事件将按照写入的顺序从分区中被消费。这意味着每个分区中的事件按时间排序,并提供事件流的部分排序。4

Partitioning a topic has an implication for event ordering. Kafka will write events to a single partition in the order they are generated by a producer, and events will be consumed from the partition in the order they are written. This means events in each partition are ordered by time, and provide a partial ordering of the event stream.4

但是,跨分区的事件没有总顺序。您必须设计您的应用程序以了解此限制。在图 14-5中,消费者将看到每个电梯的电梯乘坐事件按顺序散列到分区,但确定跨分区的电梯乘坐事件顺序是不可能的。

However, there is no total order of events across partitions. You have to design your applications to be cognizant of this restriction. In Figure 14-5, consumers will see lift ride events for each lift hashed to a partition in order, but determining the lift ride event order across partitions is not possible.

使用哈希将事件分发到主题分区
图 14-5。使用哈希将事件分发到主题分区

您还可以在初始部署后增加(但不能减少)主题分区的数量。分区中的现有事件保持不变,但具有相同键的新事件可能会被散列到不同的分区。在示例中,带有键值的突然电梯行程liftID = 2可以散列到不同的分区。因此,您必须设计您的使用者,以便他们不会期望从分区无限期地处理同一组键值。5

You can also increase—but not decrease—the number of topic partitions after initial deployment. Existing events in the partitions remain in place, but new events with the same keys may potentially be hashed to a different partition. In the example, suddenly lift rides with the key value liftID = 2 could be hashed to a different partition. You must therefore design your consumers so that they do not expect to process the same set of key values indefinitely from a partition.5

分区还可以实现并发事件传递给多个消费者。为了实现这一目标,Kafka 为某个主题引入了消费者组的概念。消费者组由一个主题的一个或多个消费者组成,最多可达为主题配置的分区数量。根据主题分区的数量和组中订阅者的数量,基本上有三种消费者分配方案:

Partitions also enable concurrent event delivery to multiple consumers. To achieve this, Kafka introduces the concept of consumer groups for a topic. A consumer group comprises one or more consumers for a topic, up to a maximum of the number of partitions configured for a topic. There are basically three consumer allocation alternatives depending on the number of topic partitions and the number of subscribers in the group:

  • 如果消费者数量group 等于分区数,Kafka 将 group 中的每个消费者恰好分配给一个分区。

  • If the number of consumers in the group is equal to the number of partitions, Kafka allocates each consumer in the group to exactly one partition.

  • 如果组内的消费者数量小于分区数量,则会分配一些消费者来消费多个分区的消息。

  • If the number of consumers in the group is less than the number of partitions, some consumers will be allocated to consume messages from multiple partitions.

  • 如果组中的消费者数量超过分区数量,则某些消费者将不会被分配分区并保持空闲状态。

  • If the number of consumers in the group exceeds the number of partitions, some consumers will not be allocated a partition and remain idle.

图 14-6说明了当 (a) 消费者组大小等于分区数量并且 (b) 消费者组大小小于分区数量时这些分配的可能性。

Figure 14-6 illustrates these allocation possibilities when (a) the consumer group size is equal to the number of partitions and (b) the consumer group size is less than the number of partitions.

Kafka 消费者组,其中 (a) 组大小 = 分区数,(b) 组大小 < 分区数
图 14-6。Kafka 消费者组,其中 (a) 组大小 = 分区数,(b) 组大小 < 分区数

Kafka为消费者组实现了再平衡机制。6当新消费者加入或现有消费者离开组或将新分区添加到主题时会触发此操作。对于每个消费者组,Kafka 分配一个代理作为组协调员。协调器跟踪主题的分区以及消费者组中的成员和订阅。如果主题分区的数量或组成员身份发生变化,协调器将开始重新平衡。重新平衡必须确保所有主题分区都分配给该组中的某个消费者,并且所有消费者组成员都分配一个或多个分区。

Kafka implements a rebalancing mechanism for consumer groups.6 This is triggered when a new consumer joins or an existing consumer leaves the group, or new partitions are added to a topic. For each consumer group, Kafka allocates one broker as the group coordinator. The coordinator tracks the partitions of topics and the members and subscriptions in the consumer group. If the number of topic partitions or group membership changes, the coordinator commences a rebalance. The rebalance must ensure that all topic partitions are allocated to a consumer from the group and all consumer group members are allocated one or more partitions.

为了执行重新平衡,Kafka 选择一个来自被选为组长的组的消费者。调用重新平衡时,代理上的组协调器会通知消费者组领导者组成员的现有分区分配以及所需的配置更改。消费者组领导者决定如何分配新的分区和组成员,并且可能需要跨组成员重新分配现有分区。在消费者之间移动分区需要当前所有者首先放弃其订阅。要触发此更改,组领导者只需从消费者的分配中删除这些订阅,并将新的分区分配发送给每个消费者。

To perform a rebalance, Kafka chooses one consumer from a group chosen as the group leader. When the rebalance is invoked, the group coordinator on the broker informs the consumer group leader of the existing partition assignments to the group members and the configuration changes needed. The consumer group leader decides how to allocate new partitions and group members, and may need to reassign existing partitions across group members. Moving a partition between consumers requires the current owner to first relinquish its subscription. To trigger this change, the group leader simply removes these subscriptions from the consumer’s allocations and sends the new partition assignments to each consumer.

每个消费者处理来自领导者的新分配:

Each consumer processes the new allocation from the leader:

  • 对于不在使用者之间移动的分区,事件处理可以继续进行,无需停机。

  • For partitions that are not moved between consumers, event processing can continue with no downtime.

  • 只需添加分配给使用者的新分区即可。

  • New partitions that are allocated to the consumer are simply added.

  • 对于任何未出现在新分配中的消费者现有分区,消费者将完成对当前批次消息的处理,提交偏移量,并放弃其订阅。

  • For any of the consumer’s existing partitions that do not appear in their new allocation, consumers complete processing the current batch of messages, commit the offset, and relinquish their subscription.

一旦消费者放弃订阅,该分区就会被标记为未分配。然后,第二轮重新平衡继续分配未分配的分区,确保每个分区都分配给该组的一个成员。图 14-7显示了将消费者添加到组时如何进行重新平衡。

Once a consumer relinquishes a subscription, that partition is marked as unassigned. A second round of rebalancing then proceeds to allocate the unassigned partitions, ensuring each partition is assigned to a member of the group. Figure 14-7 shows how the rebalancing occurs when you add a consumer to a group.

当新消费者添加到组中时,Kafka 分区重新平衡
图 14-7。当新消费者添加到组中时,Kafka 分区重新平衡

实际上,大多数重新平衡只需要很少的分区重新分配。Kafka 的再平衡方法利用了这一事实,使消费者能够在再平衡进行的同时继续处理消息。经纪商的团队协调员也很少参与,基本上只是协调重新平衡。组长负责重新分配分区。这简化了代理(请记住,愚蠢的代理架构),并且可以通过可插入的客户端框架为组注入自定义分区分配算法。Kafka 提供了一个CooperativeStickyAssignor开箱即用的功能,它可以维护尽可能多的现有分区重新平衡期间尽可能分配。

In reality, most rebalances require very few partition reassignments. Kafka’s rebalancing approach exploits this fact and enables consumers to keep processing messages while the rebalance proceeds. The group coordinator on the broker also has minimal involvement, basically just orchestrating the rebalances. The group leader is responsible for making partition reassignments. This simplifies the broker—dumb broker architecture, remember—and makes it possible to inject custom partition allocation algorithms for groups through a pluggable client framework. Kafka provides a CooperativeStickyAssignor out of the box, which maintains as many existing partition assignments as possible during a rebalance.

可用性

Availability

当您在 Kafka 中创建主题时,您可以指定复制因子N。这导致卡夫卡使用领导者-跟随者架构将主题中的每个分区复制N次。Kafka 尝试将领导者分配给不同的代理,并将副本部署到不同的代理实例以提供崩溃恢复能力。图 14-8显示了N = 3的滑雪者管理系统主题的复制分区示例。

When you create a topic in Kafka, you can specify a replication factor of N. This causes Kafka to replicate every partition in the topic N times using a leader-follower architecture. Kafka attempts to allocate leaders to different brokers and deploy replicas to different broker instances to provide crash resilience. An example of a replicated partition for the skier management system topics with N = 3 is shown in Figure 14-8.

卡夫卡主题复制
图 14-8。卡夫卡主题复制

生产者和消费者总是从领导者分区进行写入和读取,如图WhitePassTopic14-8所示。追随者也充当其关联领导者的消费者,按照replica.fetch.wait.max.ms配置参数指定的时间段(默认 500 毫秒)获取消息。

Producers and consumers always write and read from the leader partitions, as shown just for the WhitePassTopic in Figure 14-8. Followers also behave as consumers from their associated leader, fetching messages at a period specified by the replica.fetch.wait.max.ms configuration parameter (default 500 ms).

如果领导者失败,卡夫卡可以自动故障转移到其中一个追随者,以便分区保持可用。领导者代理动态维护与领导者保持同步的副本列表。该列表称为同步副本 (ISR) 列表,保存在 ZooKeeper 中,以便它在领导者失败的情况下可用。Kafka 的自定义领导者选举算法确保只有 ISR 的成员才能成为领导者。

If a leader fails, Kafka can automatically failover to one of the followers so that the partition remains available. The leader broker dynamically maintains a list of replicas that are up to date with the leader. This list, known as the in-sync replica (ISR) list, is persisted in ZooKeeper so that it is available in the event of leader failure. Kafka’s custom leader election algorithm ensures that only members of the ISR can become leaders.

在复制部署中,生产者可以acks=all在发布事件时指定数据安全性。通过此设置,领导者将不会确认一批事件,直到它们被所有 ISR 持久化为止。主题可以指定min.insync.replicas确认成功写入所需的最小 ISR ( )。如果 ISR 的数量低于此值,写入将失败。例如,您可以创建一个复制因子为 3 的主题,并将其设置min.insync.replicas为 2。只要大多数(即领导者和一个追随者)收到写入,发送操作就会成功。因此,应用程序可以通过调整最小 ISR 值来满足要求,从而在数据安全性和延迟与可用性之间进行权衡。

In a replicated deployment, producers can specify acks=all for data safety when publishing events. With this setting, the leader will not acknowledge a batch of events until they have been persisted by all ISRs. A topic can specify the minimum ISRs (min.insync.replicas) required to acknowledge a successful write. If the number of ISRs falls below this value, writes will fail. For example, you can create a topic with a replication factor of 3, and set min.insync.replicas to 2. Send operations will succeed as long as the majority, namely the leader and one follower, have received the write. Applications can therefore trade off data safety and latency versus availability by tuning the minimum ISRs value to meet requirements.

总结和延伸阅读

Summary and Further Reading

事件驱动的架构适用于现代商业环境中的许多用例。您可以使用事件来捕获外部活动并将其传输到分析系统中,以实时洞察用户和系统行为。您还可以使用事件来描述发布的状态更改,以支持跨不同系统或耦合微服务的集成。

Event-driven architectures are suitable for many use cases in the modern business landscape. You can use events to capture external activities and stream these into analytical systems to give real-time insights into user and system behaviors. You can also use events to describe state changes that are published to support integration across disparate systems or coupled microservices.

事件处理系统需要可靠、强大且可扩展的平台来捕获和传播事件。在本章中,我重点关注 Apache Kafka,因为它近年来已被广泛采用,并且适合高吞吐量、可扩展的应用程序部署。与大多数消息系统相比,Kafka 将事件保存在由消费者以非破坏性方式处理的主题中。您可以分区和复制主题以提供更高的可扩展性和可用性。

Event-processing systems require a reliable, robust, and scalable platform to capture and disseminate events. In this chapter, I’ve focused on Apache Kafka because it has been widely adopted in recent years and is suitable for high-throughput, scalable application deployments. In contrast to most messaging systems, Kafka persists events in topics that are processed in a nondestructive manner by consumers. You can partition and replicate topics to provide greater scalability and availability.

关于 Kafka 的知识,没有比《Kafka:权威指南:大规模实时数据和流处理》(第2 版)更好的来源了,作者是 Gwen Shapira、Todd Palino、Rajini Sivaram 和 Krit Petty(O'Reilly,2021 年)。有关基于事件的架构的更多一般信息,Adam Bellemare 的《构建事件驱动的微服务:大规模利用组织数据》(O'Reilly,2020 年)充满了见解和智慧。

There’s no better source of Kafka knowledge than Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale, 2nd ed., by Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty (O’Reilly, 2021). For more general information on event-based architectures, Adam Bellemare’s Building Event-Driven Microservices: Leveraging Organizational Data at Scale (O’Reilly, 2020) is full of insights and wisdom.

Kafka 是一个特别高度可配置的平台。这既可以是祝福,也可以是诅咒。通过更改各种配置参数,您可以调整吞吐量、可扩展性、数据安全性、保留和主题大小。但是,由于您可以使用这么多相互依赖的参数,因此最好的方法并不总是显而易见的。这就是为什么我建议查看一些关于 Kafka 性能和调优的研究。下面的列表是非常有趣的读物,可以帮助指导您调整 Kafka 的行为以满足您的需求:

Kafka is a particularly highly configurable platform. This can be both a blessing and a curse. By changing various configuration parameters, you can tune throughput, scalability, data safety, retention, and topic size. But with so many interdependent parameters at your disposal, the best approach is not always obvious. This is why I recommend looking at some of the studies that have been conducted on Kafka performance and tuning. The list below are really interesting reads, and can help guide you tune Kafka’s behavior to meet your needs:

1 Jay Kreps,Kafka 的发明者之一,写了这篇优秀的文章,详细介绍了日志和项目的开发。

1 Jay Kreps, one of the inventors of Kafka, wrote this excellent article going into detail about logs and the project’s development.

2 Confluence 是Kafka 连接器的主要提供商。

2 Confluent is a major provider of Kafka connectors.

3 ZooKeeper 依赖项可能会在未来版本中删除

3 The ZooKeeper dependency is likely to be removed in a future version.

4 Kafka 生产者将重试发送代理未确认的事件。这可能会导致事件的存储顺序与其最初生成的顺序不同。

4 Kafka producers will retry sending events that are not acknowledged by the broker. This may lead to events being stored in a different order from that in which they were originally produced.

5为了完全避免这种复杂性,系统通常会稍微过度配置(例如 20%)主题的分区数量,以便您可以适应增长,而无需在部署后增加分区。

5 To avoid this complexity completely, it is common for systems to slightly overprovision (e.g., 20%) the number of partitions for a topic so you can accommodate growth without increasing partitions post-deployment.

6 Kafka再平衡是一个复杂的过程;Konstantine Karantasis 的这篇博文很好地描述了它的工作原理。

6 Kafka rebalancing is a complex process; this blog post by Konstantine Karantasis gives a good description of how it works.

第 15 章流处理系统

Chapter 15. Stream Processing Systems

时间就是金钱。您从数据中提取见解和知识的速度越快,您就能越快地响应系统所观察到的世界不断变化的状态。想想信用卡欺诈检测、捕获网络安全的异常网络流量、支持 GPS 的驾驶应用程序中的实时路线规划以及识别社交媒体网站上的热门话题。对于所有这些用例,速度至关重要。

Time is money. The faster you can extract insights and knowledge from your data, the more quickly you can respond to the changing state of the world your systems are observing. Think of credit card fraud detection, catching anomalous network traffic for cybersecurity, real-time route planning in GPS-enabled driving applications, and identifying trending topics on social media sites. For all of these use cases, speed is of the essence.

这些不同的应用程序都有一个共同的要求,即需要对最新的观测集执行计算。您是否关心当天早些时候发生了一起小事故,导致您平时的行驶路线积压了 3 个小时的交通,或者昨天暴风雪导致道路通宵封闭?只要您的驾驶应用程序告诉您高速公路畅通无阻,您就可以上路了。此类计算对时间敏感,需要访问最新数据才能相关。

These disparate applications have the common requirement of needing to perform computations on the most recent set of observations. Do you care if there was a minor accident that caused a 3-hour traffic backlog on your usual driving route earlier in the day, or that yesterday a snowstorm closed the road overnight? As long as your driving app tells you the highway is clear, you’re on the way. Such computations are time sensitive and need access to recent data to be relevant.

传统上,您通过将外部源中的数据持久保存到数据库中并设计可以提取所需信息的查询来构建此类应用程序。随着系统处理的信息到达率的增加,这变得越来越难做到。您需要数据库和索引具有快速、可扩展的写入性能,以实现最近数据点的低延迟聚合读取和联接。数据库写入和读取完成后,您终于准备好执行有用的分析。有时,“终于”是在漫长的等待之后出现的,在当今世界,迟到的结果——甚至晚了几秒钟——就和根本没有结果一样糟糕。

Traditionally, you build such applications by persisting data from external feeds into a database and devising queries that can extract the information you need. As the arrival rate of the information your systems process increases, this becomes progressively harder to do. You need fast, scalable write performance from your database, and indexes to achieve low latency aggregate reads and joins for recent data points. After the database writes and the reads complete, you are finally ready to perform useful analysis. Sometimes, “finally” comes after a long wait, and in today’s world, late results—even a few seconds late—are as bad as no results at all.

面对来自传感器、设备和用户的不断增长的海量数据源,我们已经看到一种称为流处理系统的新技术的出现。这些旨在为您提供在内存中处理数据流的功能,而无需保留数据以获得所需的结果。这通常称为动态数据或实时分析。流处理平台正在成为可扩展系统的常见部分。毫不奇怪,竞争激烈的技术环境为您提供了设计和部署系统的多种选择。

In the face of an ever-growing number of high-volume data sources from sensors, devices, and users, we’ve seen the emergence of a new class of technologies known as stream processing systems. These aim to provide you with the capabilities to process data streams in memory, without the need to persist the data to get the required results. This is often called data-in-motion, or real-time analytics. Stream processing platforms are becoming common parts of scalable systems. Not surprisingly, there’s a highly competitive technology landscape that gives you plenty of choice about how to design and deploy your systems.

在本章中,我将描述流处理平台的基本概念,以及它们支持的常见应用程序架构。然后,我将使用 Apache Flink 来说明这些概念,Apache Flink 是领先的开源流技术之一。

In this chapter I’ll describe the basic concepts of stream processing platforms, and the common application architectures they enable. I’ll then illustrate these concepts using Apache Flink, which is one of the leading open source streaming technologies.

流处理简介

Introduction to Stream Processing

自从软件系统出现以来,批处理在处理新可用数据方面发挥了重要作用。在批处理系统中,原始数据代表新的和更新的对象被累积到文件中。称为批处理数据加载作业的软件组件会定期处理这些新可用的数据并将其插入应用程序的数据库中。这通常称为提取、转换、加载 (ETL) 过程。ETL意味着包含新数据的批处理文件将被处理、聚合并将数据转换为适合插入存储层的格式。

Since the dawn of time in software systems, batch processing has played a major role in the processing of newly available data. In a batch processing system, raw data representing new and updated objects are accumulated into files. Periodically, a software component known as a batch data load job processes this newly available data and inserts it into the application’s databases. This is commonly known as an extract, transform, load (ETL) process. ETL means the batch files containing new data are processed, aggregating and transforming the data into a format amenable for insertion into your storage layer.

处理一批数据后,您的分析人员和外部用户就可以使用数据。您可以向数据库发起查询,从新插入的数据中产生有用的见解。该方案如图15-1所示。

Once a batch has been processed, the data is available to your analytics and external users. You can fire off queries to your databases that produce useful insights from the newly inserted data. This scheme is shown in Figure 15-1.

批处理的一个很好的例子是房地产网站。所有新挂牌、租赁和销售均从各种数据源累积到一个批次中。该批次定期应用于底层数据库,随后对用户可见。新信息还提供分析,例如每个地区每天有多少新房源,以及前一天的房屋销售情况。

A good example of batch processing is a real estate website. All new listings, rentals, and sales are accumulated from various data sources into a batch. This batch is applied periodically to the underlying databases and subsequently becomes visible to users. The new information also feeds analytics like how many new listings are available each day in each region, and how homes have sold in the previous day.

批量处理
图 15-1。批量处理

批处理可靠、有效,是大型系统的重要组成部分。然而,缺点是新数据到达和可用于查询和分析之间存在时间滞后。一旦您积累了一批新数据(根据您的用例,这可能需要一个小时或一天),您必须等到:

Batch processing is reliable, effective, and a vital component of large-scale systems. The downside, however, is the time lag between new data arriving and it being available for querying and analysis. Once you have accumulated a new batch of data, which might take an hour or a day depending on your use case, you must wait until:

  • 您的 ETL 作业已完成将新数据摄取到存储库中

  • Your ETL job has finished ingesting the new data into your repository

  • 您的分析工作已完成

  • Your analysis job(s) complete(s)

从规模上看,整个过程的运行可能需要几分钟到几个小时。对于许多不需要绝对数据新鲜度的用例来说,这不是问题。如果您将房屋投放市场,即使您的房源在几个小时内没有出现在您最喜欢的房地产网站上,也不是世界末日。即使第二天也可以。但如果有人窃取了您的信用卡信息,等待长达 24 小时才能识别欺诈行为可能会让您的信用卡提供商损失大量金钱,并给每个人带来很多不便。对于此类用例,您需要流分析。

At scale, it can take anywhere from several minutes to several hours for this whole process to run. This is not a problem for many use cases where absolute data freshness is not required. If you put your home on the market, it’s not the end of the world if your listing doesn’t appear on your favorite real estate site for a few hours. Even the next day works. But if someone steals your credit card information, waiting up to 24 hours to identify the fraud can cost your credit card provider a lot of money, and everyone a lot of inconvenience. For such use cases, you need streaming analytics.

流媒体系统流程实时新数据和事件。当您进行信用卡购买时,信贷提供商可以利用流分析通过欺诈检测模型运行您的交易。这将使用快速统计模型预测技术(例如支持向量机)来评估交易是否可能存在欺诈行为。然后系统可以立即标记并拒绝这些交易。在这种情况下,时间确实就是金钱。这里的“实时”高度依赖于应用程序,可能意味着处理延迟从不到一秒到几秒。

Streaming systems process new data and events in real time. When you make a credit card purchase, the credit provider can utilize streaming analytics to run your transaction through a fraud detection model. This will use a fast statistical model prediction technique such as a support vector machine to evaluate whether a transaction is potentially fraudulent. The system can then flag and deny these transactions instantaneously. In this case, time really is money. “Real time” here is highly application dependent, and can mean processing latencies from less than a second to a few seconds.

流系统还可以处理批量或新数据窗口。这些有时称为微批次。例如,公共交通监控系统希望每 30 秒更新一次所有公交车的位置。公交车每隔几秒发送一次位置更新,并且这些更新作为流进行处理。流处理器聚合来自每条总线的所有更新。每 30 秒就会使用最新位置来更新交通客户在其应用程序上可见的位置。每辆公交车的一系列更新也可以被发送以进行进一步处理,以计算速度并预测路线上各个位置的到达时间。您可以在图 15-2中看到此类流系统的概览。

Streaming systems can also work on batches, or windows of new data. These are sometimes called microbatches. For example, a public transportation monitoring system wants to update the location of all buses every 30 seconds. Buses send location updates every few seconds, and these are processed as a stream. The stream processor aggregates all the updates from each bus. Every 30 seconds the latest location is used to update the location that is made visible to transportation customers on their app. The series of updates for each bus can also be sent for further processing to calculate speed and predict arrival times at locations on the route. You can see an overview of how such a streaming system looks in Figure 15-2.

流处理示例
图 15-2。流处理示例

批处理和流处理架构以及 Lambda 架构等混合架构(请参阅“Lambda 架构”)在现代可扩展系统中都占有一席之地。表 15-1总结了批处理和流处理方法,强调了它们的基本特征。

Both batch and stream processing architectures, as well as hybrids like the Lambda architecture (see “The Lambda Architecture”) have their place in modern scalable systems. Table 15-1 summarizes the batch and streaming approaches, highlighting their essential characteristics.

表 15-1。流处理和批处理的比较
特征 流处理 批量处理
批量大小 从单个事件到微批次,通常大小为数千到数万 本质上是无限的,通常有数百万到数十亿条记录
潜伏 亚秒到秒 分钟到小时
分析 针对新到达的数据在滚动时间间隔内进行相对简单的事件检测、事件聚合和度量计算 复杂,合并新批次数据和现有数据

流处理平台

Stream Processing Platforms

流处理平台激增最近几年。存在多种开源、专有和云提供商提供的解决方案,它们各有优缺点。然而,跨平台的底层架构和机制是相似的。图 15-3说明了基本的流应用程序剖析。

Stream processing platforms have proliferated in recent years. Multiple open source, proprietary, and cloud provider–supplied solutions exist, all with their own pros and cons. The underlying architecture and mechanisms across platforms are similar, however. Figure 15-3 illustrates the basic streaming application anatomy.

数据通过各种数据源提供给平台。通常,这些是队列(例如 Kafka 主题)或分布式存储系统(例如 S3)中的文件。流处理节点从数据源获取数据对象并执行转换、聚合和特定于应用程序的业务逻辑。节点被组织为有向无环图(DAG)。源自源的数据对象作为流进行处理。数据流是单个数据对象的无限序列。由于数据对象概念上在处理节点之间传递或流动,因此流应用程序也称为数据流系统。

Data is made available to the platforms through various data sources. Commonly, these are queues such as a Kafka topic, or files in distributed storage systems such as S3. Stream processing nodes ingest data objects from data sources and perform transformations, aggregations, and application-specific business logic. Nodes are organized as a directed acyclic graph (DAG). Data objects originating from the source are processed as a stream. A data stream is an unbounded sequence of individual data objects. As data objects conceptually are passed, or flow, between processing nodes, streaming applications are also known as dataflow systems.

流处理系统为处理节点提供了将一个节点处的输入流转换为由一个或多个下游节点处理的新流的能力。例如,您的交通应用程序可以每 30 秒从公交车位置更改事件流中生成当前公交车位置的新流。

Stream processing systems provide the capabilities for processing nodes to transform an input stream at one node into a new stream that is processed by one or more downstream nodes. For example, your transport application can produce a new stream of the current bus locations every 30 seconds from a stream of bus location change events.

通用流处理平台架构
图 15-3。通用流处理平台架构

流处理应用程序有两种一般风格。第一个简单地处理和转换流中的各个事件,而不需要有关每个事件的任何上下文或状态。您可以输入来自可穿戴设备的最新数据更新流,并将各个数据对象转换为代表用户最新步数、心率和每小时活动数据的其他几个数据对象。结果被写入数据接收器,例如数据库或队列,以进行下游异步处理,计算静息心率、燃烧的卡路里等。

Stream processing applications have two general flavors. The first simply processes and transforms individual events in the stream, without requiring any context, or state, about each event. You might input a stream of the latest data updates from wearable devices and transform the individual data objects into several others representing the user’s latest step counts, heart rate, and hourly activity data. The results are written to data sinks such as a database or a queue for downstream asynchronous processing that calculates resting heart rate, calories burned, and so on.

相反,一些流应用程序需要维护在流中各个数据对象的处理过程中持续存在的状态。运输监控应用程序必须了解所有正在运行的公交车,并维护代表过去 30 秒位置更新的状态。欺诈检测应用程序必须维护表示识别可疑交易所需的当前模型参数的状态。零售商店流应用程序必须维护表示过去一小时内售出的每件商品数量的信息,以识别需求量较大的商品。这种类型的应用程序称为有状态流应用程序。

In contrast, some streaming applications need to maintain state that persists across the processing of individual data objects in the stream. The transport monitoring application must know about all the buses in motion and maintain state representing the position updates in the last 30 seconds. A fraud detection application must maintain state representing the current model parameters needed to identify suspicious transactions. A retail store streaming application must maintain information representing the number of each individual item sold in the last hour to identify goods in high demand. This flavor of applications is known as stateful streaming applications.

最后,流处理平台需要能够使应用程序能够扩展其处理能力并能够适应故障。这通常是通过跨计算资源集群执行处理节点的多个实例并实现状态检查点机制以支持故障后恢复来实现的。如何实现这一点很大程度上依赖于平台。

Finally, stream processing platforms need capabilities to enable applications to scale out their processing and be resilient to failures. This is typically achieved by executing multiple instances of processing nodes across a cluster of computational resources, and implementing a state checkpointing mechanism to support recovery after failure. How this is achieved is extremely platform dependent.

作为缩放的示例,以下 Apache Storm 代码创建一个流处理应用程序(在 Storm 中称为拓扑),其中包含单个数据源和排列为简单管道的两个处理节点:

As an example of scaling, the following Apache Storm code creates a stream processing application (called a topology in Storm) with a single data source and two processing nodes arranged as a simple pipeline:

TopologyBuilder 构建器 = new TopologyBuilder();        
builder.setSpout(“purchasesSpout”,新的PurchasesSpout());        
builder.setBolt("totalsBolt", new PurchasingTotals(), numTotalsBolts)
        fieldsGrouping("purchasesSpout", new Fields("itemKey"));
builder.setBolt("topSellersBolt", new TopSellers())
        .globalGrouping("totalsBolt");
TopologyBuilder builder = new TopologyBuilder();        
builder.setSpout("purchasesSpout", new PurchasesSpout());        
builder.setBolt("totalsBolt", new PurchaseTotals(), numTotalsBolts)
        fieldsGrouping("purchasesSpout", new Fields("itemKey"));
builder.setBolt("topSellersBolt", new TopSellers())
        .globalGrouping("totalsBolt");

其工作原理如下。

It works as follows.

对象PurchasesSpout将购买记录作为数据源中的流发出。Storm 中的 Spout 将流应用程序连接到数据源(例如队列)。

A PurchasesSpout object emits purchase records as a stream from a data source. A spout in Storm connects the streaming applications to a data source such as a queue.

购买流从 Spout 传递到处理节点对象(称为 Bolt)。这就是PurchaseTotals对象。它维护所有商品的购买总额。由参数定义的螺栓的多个实例numTotalsBolts由 Storm 作为独立线程执行。确保fieldsGrouping具有相同itemKey价值的购买始终从 spout 发送到相同的 Bolt 实例,以便每个键的总数由单个 Bolt 管理。

The stream of purchases is passed from the spout to a processing node object, known as a bolt. This is the PurchaseTotals object. It maintains purchase totals for all items. Multiple instances of the bolt, defined by the numTotalsBolts parameter, are executed by Storm as independent threads. The fieldsGrouping ensures that purchases with the same itemKey value are always sent from the spout to the same bolt instance so that the total for every key is managed by a single bolt.

螺栓PurchaseTotals将更改的总购买量发送到TopSellers螺栓。这将创建流中最畅销商品的排行榜。globalGrouping将所有实例的输出路由到PurchaseTotals单个TopSellers螺栓实例。

The PurchaseTotals bolt sends a stream of changed total purchases to the TopSellers bolt. This creates a leaderboard of the best-selling items in the stream. The globalGrouping routes the output of all PurchaseTotals instances to a single TopSellers bolt instance.

Storm 的逻辑拓扑如图 15-4所示。根据部署拓扑的底层集群配置,Storm 将在一个或多个可用 JVM 中将指定数量的 Bolt 实例作为线程执行。这使得拓扑能够利用部署环境中可用的计算资源。

The logical Storm topology is depicted in Figure 15-4. Depending on the underlying cluster configuration that the topology is deployed on, Storm will execute the specified number of bolt instances as threads in one or more available JVMs. This enables topologies to take advantage of the computational resources available in the deployment environment.

Apache Storm 拓扑示例
图 15-4。Apache Storm 拓扑示例

Apache Storm 是一个强大且可扩展的流媒体平台。然而,它的 API 相对简单,并且将显式拓扑定义的责任交给了应用程序设计者。在里面在本章的剩余部分,我将重点关注更现代的 Apache Flink,它提供了用于构建流应用程序的函数式编程 API。

Apache Storm is a powerful and scalable streaming platform. Its API is relatively simple, however, and places the responsibility for explicit topology definition on the application designer. In the remainder of this chapter, I’ll focus instead on the more contemporary Apache Flink, which provides functional programming APIs for building streaming applications.

结论和进一步阅读

Conclusions and Further Reading

流系统产生相关且及时的结果的能力在许多应用领域中非常有吸引力。您可以实时转换、聚合和分析传入数据。您的应用程序可以根据时间窗口或消息量对有限批次的数据执行分析。这使得识别数据趋势并根据最新数据窗口中的值计算指标成为可能。

The ability of streaming systems to produce relevant and timely results is highly attractive in many application domains. You can transform, aggregate, and analyze incoming data in real time. Your applications can perform analyses on finite batches of data based on time windows or message volumes. This makes it possible to identify trends in data and calculate metrics based on values in the most recent windows of data.

您可以利用许多流媒体平台来构建容错、可扩展的应用程序。可扩展性是通过将逻辑数据流应用程序架构转换为物理等效架构来实现的,该物理等效架构跨集群中的计算资源分布和连接系统中的处理节点。容错机制保留处理节点状态并跟踪哪些消息已通过完整的数据流应用程序成功处理。当发生故障时,可以从第一条未完成的消息重新启动流。

There are numerous streaming platforms that you can utilize to build fault-tolerant, scalable applications. Scalability is achieved by transforming your logical dataflow application architecture into a physical equivalent that distributes and connects processing nodes in the system across computational resources in a cluster. Fault tolerance mechanisms persist processing node state and track which messages have been successfully processed through the complete dataflow application. When failures occur, the streams can be restarted from the first outstanding message.

Tyler Akidau、Slava Chernyak 和 Reuven Lax 所著的《Streaming Systems: The What,Where,When,and How of Large Scale Data Processing》是一本涵盖流应用程序广泛设计和开发问题的好书(O'Reilly, 2018)。

A great book that covers the broad spectrum of design and development issues for streaming applications is Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing by Tyler Akidau, Slava Chernyak, and Reuven Lax (O’Reilly, 2018).

以下书籍是该领域许多领先竞争者的极好知识来源。其中包括 Apache Flink、Apache Storm、Kinesis、Apache Kafka Streams、Apache Spark Streams 和 Spring Cloud Data Flow:

The books below are excellent sources of knowledge for a number of the leading contenders in this space. These include Apache Flink, Apache Storm, Kinesis, Apache Kafka Streams, Apache Spark Streams, and Spring Cloud Data Flow:

  • Fabian Hueske 和 Vasiliki Kalavri,使用 Apache Flink 进行流处理:流应用程序的基础知识、实现和操作(O'Reilly,2019 年)

  • Fabian Hueske and Vasiliki Kalavri, Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications (O’Reilly, 2019)

  • Mitch Seymour,掌握 Kafka Streams 和 ksqlDB:通过示例构建实时数据系统(O'Reilly,2021 年)

  • Mitch Seymour, Mastering Kafka Streams and ksqlDB: Building Real-Time Data Systems by Example (O’Reilly, 2021)

  • Tarik Makota、Brian Maguire、Danny Gagne 和 Rajeev Chakrabarti,使用 Amazon Kinesis 实现可扩展数据流(Packt,2021)

  • Tarik Makota, Brian Maguire, Danny Gagne, and Rajeev Chakrabarti, Scalable Data Streaming with Amazon Kinesis (Packt, 2021)

  • Sean T. Allen、Matthew Jankowski 和 Peter Pathirana,Storm Applied:实时事件处理策略(Manning,2015 年)

  • Sean T. Allen, Matthew Jankowski, and Peter Pathirana, Storm Applied: Strategies for Real-Time Event Processing (Manning, 2015)

  • Gerard Maas 和 Francois Garillot,使用 Apache Spark 进行流处理:掌握结构化流和 Spark 流(O'Reilly,2019 年)

  • Gerard Maas and Francois Garillot, Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming (O’Reilly, 2019)

  • Felipe Gutierrez,Spring Cloud Data Flow:现代运行时微服务应用程序的本机云编排服务(Apress,2021 年)

  • Felipe Gutierrez, Spring Cloud Data Flow: Native Cloud Orchestration Services for Microservice Applications on Modern Runtimes (Apress, 2021)

1该方法基于Paris Carbone 等人的论文“分布式数据流的轻量级异步快照” 。

1 The approach is based on the paper “Lightweight Asynchronous Snapshots for Distributed Dataflows” by Paris Carbone et al.

第 16 章成功的最后秘诀

Chapter 16. Final Tips for Success

让我们直言不讳吧。构建可扩展的分布式系统很难!

Let’s be blunt. Building scalable distributed systems is hard!

分布式系统本质上是复杂的,具有多种故障模式,您必须考虑这些故障模式,并进行设计以处理所有可能发生的情况。当您的应用程序因高请求量和快速增长的数据资源而承受压力时,事情会变得更加棘手。

Distributed systems by their very nature are complex, with multiple failure modes that you must take into consideration, and design to handle all eventualities. It gets even trickier when your applications are stressed by high request volumes and rapidly growing data resources.

大规模应用程序需要大量协作的硬件和软件组件,这些组件共同创造了实现低延迟和高吞吐量的能力。您面临的挑战是将所有这些移动部件组合成一个应用程序,该应用程序既能满足要求,又不会花费您大量的运行时间。

Applications at scale require numerous, cooperating hardware and software components that collectively create the capacity to achieve low latencies and high throughput. Your challenge is to compose all these moving parts into an application that satisfies requirements and doesn’t cost you the earth to run.

在本书中,我涵盖了作为可扩展分布式系统基础的原理、架构、机制和技术的广阔前景。有了这些知识,您就可以开始设计和构建大型应用程序。

In this book I’ve covered the broad landscape of principles, architectures, mechanisms, and technologies that are foundational to scalable distributed systems. Armed with this knowledge, you can start to design and build large-scale applications.

我想当你听到故事还没有结束时,你不会感到惊讶。我们都在新的应用需求和新的硬件和软件技术不断变化的环境中运作。虽然分布式系统的基本原理仍然成立(无论如何,在可预见的未来,量子物理学有一天可能会改变一切),但新的编程抽象、平台模型和硬件使您可以更轻松地构建更复杂的系统,提高性能、可扩展性、和韧性。推动我们穿越这一技术领域的隐喻火车永远不会减慢速度,而且可能只会变得更快。准备好迎接不断学习新东西的疯狂旅程。

I suspect that you will not be surprised to hear that this is not the end of the story. We all operate in an ever-changing landscape of new application requirements and new hardware and software technologies. While the underlying principles of distributed systems still hold (for the foreseeable future anyway—quantum physics might change things one day), new programming abstractions, platform models, and hardware make it easier for you to build more complex systems with increased performance, scalability, and resilience. The metaphorical train that propels us through this technology landscape will never slow down, and probably only get faster. Be prepared for a wild ride of constantly learning new stuff.

此外,成功的可扩展系统还有许多基本要素,我在本书中没有介绍这些要素。其中四个如图 16-1所示,我将在以下小节中简要描述每个问题的突出问题。

In addition, there are numerous essential ingredients for successful scalable systems that I have not covered in this book. Four of these are depicted in Figure 16-1, and I briefly describe the salient issues of each in the following subsections.

可扩展的分布式系统
图 16-1。可扩展的分布式系统

自动化

Automation

工程师的费用相当昂贵,但构建大型系统时的重要资源。任何需要大规模部署的系统很快都需要数百名才华横溢的工程师。以互联网巨头的规模来看,这个数字会增长到数千。然后,您的工程师需要能够针对不断增长的复杂代码库快速推出更改、修复和新功能。每天能够在不停机的情况下有效地将数百个更改推送到已部署的系统是规模化的关键。您需要部署频繁的更改以改善客户体验并确保可靠且可扩展的操作。

Engineers are rather expensive but essential resources when building large-scale systems. Any system that needs to be deployed at scale is quickly going to require hundreds of talented engineers. At the scale of the internet giants, this number grows to many thousands. Your engineers then need to be able to rapidly roll out changes, fixes, and new features to growing, complex codebases. The ability to efficiently push hundreds of changes per day to a deployed system without downtime is key at scale. You need to deploy frequent changes to improve the client experience and ensure reliable and scalable operations.

自动化使开发人员能够快速可靠地更改操作系统。促进这种自动化的工具和实践集体现在 DevOps 学科中。在《DevOps:软件架构师的视角》(O'Reily,2015 年)中,Len Bass 等人。将 DevOps 定义为“一组旨在减少对系统进行更改和将更改投入正常生产之间的时间,同时确保高质量的实践。”

Automation makes it possible for developers to rapidly and reliably make changes to operational systems. The set of tools and practices that facilitate such automation are embodied in the discipline of DevOps. In DevOps: A Software Architect’s Perspective (O’Reily, 2015), Len Bass et al. define DevOps as “a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality.”

DevOps 包含一组实践以及基于开发和部署过程各个级别的自动化的工具。DevOps 的核心是持续交付 (CD) 实践,1由用于代码配置管理、自动化测试、部署和监控的复杂工具链提供支持。DevOps 通过使部署环境的管理成为开发团队的责任来扩展这些实践。这通常包括团队成员 24 小时轮流值班,以响应生产中的事件或故障。

DevOps encompasses a set of practices and tooling that are based on automation at all levels of the development and deployment process. At the heart of DevOps are continuous delivery (CD) practices,1 supported by sophisticated toolchains for code configuration management, automated testing, deployment, and monitoring. DevOps extends these practices by making the management of the deployment environment the responsibility of the development teams. This typically includes rotating 24-hour on-call responsibilities for team members to respond to incidents or failures in production.

DevOps 实践对于成功的可扩展系统至关重要。团队负责设计、开发和运营自己的微服务,这些微服务通过定义良好的接口与系统的其余部分进行交互。借助自动化工具链,他们可以独立部署本地更改和新功能,而不会干扰系统操作。这减少了协调开销,提高了生产力,并促进了快速的发布周期。所有这些都意味着您的事业将获得更大的成功工程美元。

DevOps practices are essential for successful scalable systems. Teams have responsibilities for designing, developing, and operating their own microservices, which interact with the rest of the system through well-defined interfaces. With automated toolchains, they can independently deploy local changes and new features without perturbing the system operations. This reduces coordination overheads, increases productivity, and facilitates fast release cycles. All of which means you get a much bigger bang for your engineering dollars.

可观察性

Observability

“你无法管理你无法衡量的东西,”所以俗话说。在大型软件系统中,这确实是事实。由于存在大量移动部件,所有部件都在可变负载条件下运行,而且都容易出现不可预测的错误,因此您需要通过测量系统的运行状况和行为来获得见解。可观测性解决方案涵盖了这一系列的需求,包括:

“You can’t manage what you can’t measure,” so goes the saying. In large-scale software systems, this is indeed the truth. With multitudes of moving parts, all operating under variable load conditions and all unpredictably error-prone, you need insights gained through measurements on the health and behavior of your systems. An observability solution encompasses this spectrum of needs, including:

  • 根据不断生成的细粒度指标和日志数据捕获系统当前状态的基础设施

  • The infrastructure to capture a system’s current state based on constantly generated fine-grained metrics and log data

  • 分析聚合实时指标并采取行动并对指示实际或待处理故障的警报做出反应的功能

  • The capabilities to analyze and act on aggregated real-time metrics and react to alerts indicating actual or pending failures

可观察性的第一个基本要素是一个仪表系统,它不断以指标和日志条目的形式发出系统遥测数据。这种遥测的来源多种多样。它可以源自操作系统、您在应用程序中使用的基础平台(例如消息传递、数据库)以及您部署的应用程序代码。指标代表系统各个部分所提供的资源利用率以及延迟、响应时间和吞吐量。

The first essential element of observability is an instrumented system that constantly emits system telemetry in the form of metrics and log entries. The sources of this telemetry are many and varied. It can be sourced from operating systems, the foundational platforms (e.g., messaging, databases) you utilize in your applications, and the application code you deploy. Metrics represent resource utilizations and the latencies, response times, and throughput the various parts of your system are delivering.

代码检测是强制性的,您可以使用开源框架(例如OpenTelemetry)或专有解决方案(例如 AWS CloudWatch)来发出特定于应用程序的指标。这些指标和日志条目形成了基于时间序列的连续数据流,描述了您的应用程序随时间的行为特征。

Code instrumentation is mandatory, and you can use open source frameworks (e.g., OpenTelemetry) or proprietary solutions (e.g., AWS CloudWatch) to emit application-specific metrics. These metrics and log entries form a continuous stream of time-series based data that characterizes your application behavior over time.

捕获原始指标数据只是可观察性推断的态势感知的先决条件。您需要快速处理该数据流,以便其对系统操作可操作。这包括持续监控当前状态、探索历史数据以了解或诊断某些意外的系统行为,以及在超过阈值或发生故障时发送实时警报。您可以从许多支持监控和探索时间序列数据以实现可观察性的复杂解决方案中进行选择。PrometheusGrafanaGraphite是三种广泛使用的技术,它们为可观测性堆栈的各个部分提供开箱即用的解决方案。

Capturing raw metrics data is simply a prerequisite for the situational awareness that observability infers. You need to rapidly process this stream of data so that it becomes actionable for systems operations. This includes both continuous monitoring of current state, exploring historical data to understand or diagnose some unexpected system behavior, and sending real-time alerts when thresholds are exceeded or failures occur. You can choose from a number of sophisticated solutions that support monitoring and exploration of time-series data for observability. Prometheus, Grafana, and Graphite are three widely used technologies that provide out-of-the-box solutions for various parts of an observability stack.

可观察性是可扩展分布式系统的必要组成部分。忽视它后果自负!你会发现一个很棒的了解有关可观察性的更多信息的来源是 Charity Majors 等人所著的《可观察性工程》(O'Reilly)一书。

Observability is a necessary component of scalable distributed systems. Ignore it at your peril! You’ll find a great source for learning more about observability is the book by Charity Majors et al., Observability Engineering (O’Reilly).

部署平台

Deployment Platforms

可扩展的系统需要广泛的、有弹性的、以及可靠的计算和数据平台。现代公共云和私有数据中心的墙壁和天花板上都装满了硬件,您只需点击一两下鼠标即可配置。更好的是,配置是使用专为操作设计的脚本语言自动调用的。这称为基础设施即代码 (IaC),是 DevOps 的重要组成部分。

Scalable systems need extensive, elastic, and reliable compute and data platforms. Modern public clouds and private data centers are packed to the walls and ceilings with hardware you can provision with the click or two of a mouse. Even better, provisioning is invoked automatically using scripting languages designed for operations. This is known as infrastructure as code (IaC), an essential ingredient of DevOps.

虚拟机传统上是应用程序的部署单元。然而,在过去的几年里,基于容器技术的新的轻量级方法不断涌现,Docker就是一个突出的例子。容器映像可以将应用程序代码和依赖项打包到单个可部署单元中。当部署在容器引擎(例如 Docker 引擎)上时,容器作为独立进程运行,与其他容器共享主机操作系统。与虚拟机相比,容器消耗的资源要少得多,因此可以通过将多个容器打包在单个虚拟机上来更有效地利用硬件资源。

Virtual machines were traditionally the unit of deployment for applications. However, the last few years have seen the proliferation of new lighter-weight approaches based on container technologies, with Docker being the preeminent example. Container images enable the packaging of application code and dependencies into a single deployable unit. When deployed on a container engine such as the Docker Engine, containers run as isolated processes that share the host operating systems with other containers. Compared to virtual machines, containers consume considerably fewer resources, and hence make it possible to utilize hardware resources more efficiently by packing multiple containers on a single virtual machine.

容器通常与集群管理平台(例如 Kubernetes 或 Apache Mesos)配合使用。这些编排平台提供 API,供您控制容器的执行方式、时间和位置。它们使您可以使用自动扩展来自动部署容器以支持不同的系统负载,并简化跨集群中多个节点部署多个容器的管理。

Containers are typically utilized in concert with a cluster management platform such as Kubernetes or Apache Mesos. These orchestration platforms provide APIs for you to control how, when, and where your containers execute. They make it possible to automate your deployment of containers to support varying system loads using autoscaling and simplify the management of deploying multiple containers across multiple nodes in a cluster.

数据湖

Data Lakes

您多久会在屏幕上往回滚动一次最喜欢的社交媒体源来查找您 5 年前、10 年前甚至更多年前发布的照片​​?我敢打赌,不会经常发生。我敢打赌,你的关系网做的事情甚至更少。如果您尝试一下,您可能会发现,一般来说,您回溯得越远,渲染照片所需的时间就越长。

How often do you scroll back in time on your favorite social media feed to look for photos you posted 5, 10, or even more years ago? Not very often, I bet. And I bet your connections do it even less. If you give it a try, you’ll probably find, in general, that the further you go back in time, the longer your photos will take to render.

这是大规模历史数据管理面临挑战的一个例子。随着时间的推移,您的系统将生成许多 PB 或更多的数据。其中大部分数据很少被您的用户访问(如果有的话)。但由于您的应用程序领域所规定的原因(例如,监管、合同、流行度),您需要在少数需要的情况下保留历史数据。

This is an example of the historical data management challenges faced at scale. Your systems will generate many petabytes or more of data over time. Much of this data is rarely, if ever accessed by your users. But for reasons that your application domain dictates (e.g., regulatory, contractual, popularity), you need to keep historical data available for the few occasions it is requested.

管理、组织和存储这些历史数据存储库是数据仓库、大数据和(最近的)数据湖的领域。虽然这些方法之间存在技术和哲学差异,但其本质是以可检索、查询和分析的形式存储历史数据。

Managing, organizing, and storing these historical data repositories is the domain of data warehousing, big data, and (more recently) data lakes. While there are technical and philosophical differences between these approaches, their essence is storage of historical data in a form it can be retrieved, queried, and analyzed.

数据湖的特点通常是以异构格式存储和编目数据,从本机 blob 到 JSON,再到关系数据库提取。他们利用低成本对象存储,例如 Apache Hadoop、Amazon S3 或 Microsoft Azure Data Lake。灵活的查询引擎支持数据的分析和转换。您还可以使用不同的存储类,本质上是提供更长的检索时间以降低成本,从而优化您的成本。

Data lakes are usually characterized by storing and cataloging data in heterogeneous formats, from native blobs to JSON to relational database extracts. They leverage low-cost object storage such as Apache Hadoop, Amazon S3, or Microsoft Azure Data Lake. Flexible query engines support analysis and transformation of the data. You can also use different storage classes, essentially providing longer retrieval times for lower cost, to optimize your costs.

进一步阅读和结论

Further Reading and Conclusions

大规模设计、构建、操作和发展软件系统的内容比一本书所能涵盖的内容要多得多。本章简要描述了您在生产系统中需要了解和解决的可扩展系统的四个内在元素。将这些元素添加到现代软件架构师需要拥有的不断扩大的知识库中。

There’s a lot more to designing, building, operating, and evolving software systems at massive scale than can be covered in a single book. This chapter briefly describes four intrinsic elements of scalable systems that you need to be aware of and address in production systems. Add these elements to the ever-expanding palette of knowledge that modern software architects need to possess.

我将为您推荐一些我认为每个人(虚拟)书架上都应该有的书籍。

I’ll leave you with a couple of recommendations for books I think everyone should have on their (virtual) bookshelf.

首先是Betsy Beyer等人主编的经典书籍《Site Reliability Engineering: How Google Runs Production Systems》 。(O'Reilly) 描述了 Google 为运行其生产系统而开发的一组实践和工具。它对保持大规模系统基础设施运行和健康所需的方法进行了广泛、彻底和跨领域的描述。

First, the classic book Site Reliability Engineering: How Google Runs Production Systems, edited by Betsy Beyer et al. (O’Reilly) describes the set of practices and tooling that Google developed to run their production systems. It is an extensive, thorough, and cross-cutting description of the approaches needed to keep massive-scale system infrastructures operating and healthy.

与广泛知识类似,Neal Ford 等人编写的《软件架构:硬部件》 。(O'Reilly)充满了关于如何解决现代系统提出的许多设计难题的见解和示例。这些设计问题很少有简单、正确的解决方案。为此,作者描述了如何应用当代建筑设计知识和权衡分析来达成满意的解决方案。

In a similar vein of wide-ranging knowledge, Software Architecture: The Hard Parts, by Neal Ford et al. (O’Reilly) is chock-full with insights and examples of how to address the many design conundrums that modern systems present. There’s rarely, if ever, simple, correct solutions to these design problems. To this end, the authors describe how to apply contemporary architecture design knowledge and trade off analysis to reach satisfactory solutions.

阅读愉快!

Happy reading!

1该领域的经典书籍是 Jez Humble 和 David Farley 的《持续交付:通过构建、测试和部署自动化实现可靠的软件发布》(Addison-Wesley Professional,2010 年)。

1 The classic book in this area is Jez Humble and David Farley’s Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Addison-Wesley Professional, 2010).

指数

Index

符号

Symbols

A

A

B

C

C

D

D

E

F

F

G

G

H

H

I

J

J

K

K

L

L

中号

M

N

O

P

Q

R

S

S

时间

T

U

U

V

V

W

Z

Z

关于作者

About the Author

Ian Gorton拥有 30 年的软件架构师、作家、计算机科学教授和顾问经验。他从研究生院开始就专注于分布式技术,并致力于银行、电信、政府、医疗保健以及科学建模和仿真等领域的大型软件系统的研究。在此期间,他目睹了软件系统发展到今天常规运行的大规模。

Ian Gorton has 30 years’ experience as a software architect, author, computer science professor, and consultant. He has focused on distributed technologies since his days in graduate school and has worked on large-scale software systems in areas such as banking, telecommunications, government, health care, and scientific modeling and simulation. During this time, he has seen software systems evolve to the massive scale they routinely operate at today.

Ian 撰写了三本书,包括《基本软件架构》《数据密集型计算》,并且是 200 多本有关软件架构和软件工程的科学和专业出版物的作者。在卡内基梅隆软件工程学院,他领导了大数据和大规模可扩展系统的研发项目,自 2015 年加入东北大学担任计算机科学教授以来,他继续就这些主题进行工作、写作和演讲。他拥有博士学位英国谢菲尔德哈勒姆大学博士,IEEE计算机学会高级会员。

Ian has written three books, including Essential Software Architecture and Data Intensive Computing, and is the author of over 200 scientific and professional publications on software architecture and software engineering. At the Carnegie Mellon Software Engineering Institute, he led R&D projects in big data and massively scalable systems, and he has continued working, writing, and speaking on these topics since joining Northeastern University as a professor of computer science in 2015. He has a PhD from Sheffield Hallam University, UK, and is a senior member of the IEEE Computer Society.

版画

Colophon

《可扩展系统的基础》封面上的动物是暗色石斑鱼(Epinephelus marginatus),也称为黄腹岩鳕鱼或黄腹石斑鱼。它常见于地中海,范围从非洲海岸的伊比利亚半岛一直延伸到莫桑比克,从巴西延伸到阿根廷北部。暗色石斑鱼通常出现在从表面到深度约 300 米的岩石海洋区域。它们是伏击食者,隐藏在岩石中,然后吸入猎物并将其整个吞下。

The animal on the cover of Foundations of Scalable Systems is a dusky grouper (Epinephelus marginatus), also known as the yellowbelly rock cod or yellowbelly grouper. It is common in the Mediterranean Sea, and its range stretches from the Iberian Peninsula along the coast of Africa to Mozambique and from Brazil to northern Argentina. Dusky groupers are normally found in rocky marine areas from the surface down to a depth of about 300 meters. They are ambush feeders, hiding among the rocks and then sucking in prey and swallowing it whole.

Like other groupers, they have large, oval bodies and wide mouths with protruding lower jaws. Dusky groupers have dark reddish-brown or grayish heads with yellow bellies and pale blotches on the head and body. They can reach up to five feet long and can weigh over a hundred pounds. All dusky groupers begin adult life as females and begin to breed at around five years of age, but they develop into males between their ninth and sixteenth years. They live up to 50 years in the wild.

Like other groupers, they have large, oval bodies and wide mouths with protruding lower jaws. Dusky groupers have dark reddish-brown or grayish heads with yellow bellies and pale blotches on the head and body. They can reach up to five feet long and can weigh over a hundred pounds. All dusky groupers begin adult life as females and begin to breed at around five years of age, but they develop into males between their ninth and sixteenth years. They live up to 50 years in the wild.

The dusky grouper is a popular food fish, leading it to become a victim of overfishing. Although conservation efforts are being taken, the species is classified as vulnerable. Many of the animals on O’Reilly covers are endangered; all of them are important to the world.

The dusky grouper is a popular food fish, leading it to become a victim of overfishing. Although conservation efforts are being taken, the species is classified as vulnerable. Many of the animals on O’Reilly covers are endangered; all of them are important to the world.

封面插图由凯伦·蒙哥马利 (Karen Montgomery) 绘制,以约翰逊的《自然历史》中的古董线条雕刻为基础。封面字体为 Gilroy Semibold 和 Guardian Sans。文字字体为Adobe Minion Pro;标题字体为 Adob​​e Myriad Condensed;代码字体是Dalton Maag的Ubuntu Mono。

The cover illustration is by Karen Montgomery, based on an antique line engraving from Johnson’s Natural History. The cover fonts are Gilroy Semibold and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.